C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 06 Sep 2019 20:28:50 -0700
On Friday, 6 September 2019 19:17:00 PDT Lyberta wrote:
> Thiago Macieira:
> > [*] The only remaining issue is the perfectly valid case of setting
> > LC_ALL=C in the environment for reading other tools' output. I would
> > recommend just ignoring that.
>
> I think if the machine-readable output depends on locale, the author of
> the program seriously messed up.

Oh, I agree with you. The problem is that the standard C library (as extended
by POSIX) does not provide the API to make that happen *and* support
internationalisation. And that's assuming the tool even have a "machine
readable" format in the first place. In the Unix tradition, you just scrape
the output of tools.

$ du -sh
1,8G .

Note the comma instead of dot?

$ find -ls
  4719721 4 drwxr-xr-x 3 tjmaciei users 4096 set 6 19:39 .
  4722472 4 -rw-r--r-- 1 tjmaciei users 2927 set 6 19:39 ./
generate.pl
  4719722 4 drwxr-xr-x 2 tjmaciei users 4096 jun 18 17:18 ./
packages
  4719723 2228 -rw-r--r-- 1 tjmaciei users 2280402 fev 8 2019 ./
packages/freedesktop.org.xml
  4742041 236 -rw-r--r-- 1 tjmaciei users 239063 fev 8 2019 ./
packages/freedesktop.org.xml.zst
  4721630 4 -rw-r--r-- 1 tjmaciei users 2391 set 6 19:39 ./
generate.bat
  4722858 4 -rw-r--r-- 1 tjmaciei users 1739 set 6 19:39 ./
hexdump.ps1

Note the month names in Portuguese (in a date format that is neither valid
Portuguese nor English, because no one in their sane mind would put day
between month and year).

> Corentin:
> > Supporting non displayable characters in build tools has no value. For
> > anyone. "Someone might do that" is the reason we don't have nice things.
>
> 100% agree. If the user has non-UTF paths, the job of the build system
> is to show message "Mate, you shot yourself in the foot. Fix your file
> system." It's that simple.

This is the philosophy that Qt has adopted too: file names that cannot be
decoded by the locale codec are filesystem corruption. The build tools do not
need to support them.

C++ might have, but Niall has that well in hand.

> > int main(int argc, char *argv[])
>
> So,
>
> int main(std::span<std::unicode::text_view> args)
>
> then?

I worked with Erich Keane to come up with a solution for this. I think we even
had a discussion in one of the mailing lists.

But the input is not Unicode, it's file paths. On Unix, it is possible to pass
binary input in the command-line. With some effort, you can even pass NULs to
specially crafted receiver applications. The std::filesystem API appears to
have a way to retrieve the native raw format, which some application may need.

Qt doesn't care about those. QCoreApplication::arguments() is a list of
QStrings, decoded using QFile::decodeName. Binary data will be silently
corrupted:

 $ strace uic $'\xe9.ui' |& grep -aF .ui
execve("/home/tjmaciei/bin/uic", ["uic", "\351.ui"], 0x7fffa282edc8 /* 118
vars */) = 0
execve("/home/tjmaciei/obj/qt/qt5/qtbase/bin/uic", ["/home/tjmaciei/obj/qt/
qt5/qtbase"..., "\351.ui"], 0x7ffc30f6ad60 /* 118 vars */) = 0
openat(AT_FDCWD, "\357\277\275.ui", O_RDONLY|O_CLOEXEC) = -1 ENOENT (Arquivo
ou diretório inexistente)
write(2, "File '\357\277\275.ui' is not valid\n", 27File '�.ui' is not valid

Oh, we got errors in Portuguese. Let me set LC_ALL=C:

$ LC_ALL=C strace uic $'\xe9.ui' |& grep -aF .ui
execve("/home/tjmaciei/bin/uic", ["uic", "\351.ui"], 0x7ffd0cd9f448 /* 119
vars */) = 0
execve("/home/tjmaciei/obj/qt/qt5/qtbase/bin/uic", ["/home/tjmaciei/obj/qt/
qt5/qtbase"..., "\351.ui"], 0x7ffe3a738e00 /* 119 vars */) = 0
openat(AT_FDCWD, "?.ui", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or
directory)
write(2, "File '?.ui' is not valid\n", 25File '?.ui' is not valid

I'm right now tempted to submit a patch that makes Qt assume that locale "C"
is actually "C.UTF-8".

And since we're on the subject of strace, see how it is not parseable without
LC_ALL=C:

$ strace -c true
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
 24,09 0,000191 19 10 8 openat
 22,19 0,000176 25 7 mmap
 13,75 0,000109 13 8 7 stat
 11,85 0,000094 23 4 mprotect
  6,56 0,000052 52 1 munmap
  6,18 0,000049 49 1 1 access
  5,30 0,000042 42 1 brk
  2,65 0,000021 10 2 fstat
  2,52 0,000020 10 2 close
  1,89 0,000015 15 1 execve
  1,64 0,000013 13 1 read
  1,39 0,000011 11 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0,000793 39 16 total

Note the commas for the percentages and times (except for the 100.00!).

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-07 05:28:55