C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 06 Sep 2019 13:59:37 -0700
On Friday, 6 September 2019 13:09:23 PDT Tony V E wrote:
> On Fri, Sep 6, 2019 at 3:52 PM Thiago Macieira <thiago_at_[hidden]> wrote:
> > > - if you encode the raw bytes, there might still be cases not covered,
> > > might need to fall back to UTF8. It sounds like *no* answer will be
> > > guaranteed to work.
> >
> > Which case could there be that the raw bytes fail but UTF-8 supports? I
> > would
> > think it's the other way around.
>
> the case where the encoding changed. Or the raw bytes are being used with
> the wrong FS API.

The encoding is only needed when converting raw bytes to text. Since there's
no conversion, the raw bytes are passed through unmodified from payload to
filesystem API and from filesystem API to the payload.

The two files below are simply bags of bytes:

$ for f in /tmp/*.c; do paste - <<<$f <(printf $f | xxd -ps); done
/tmp/�.c 2f746d702fe92e63
/tmp/é.c 2f746d702fc3a92e63

The pass-through is possible for all systems where the native FS API is 8-bit.
This includes Cygwin and WSL.

For systems where the FS API is natively 16-bit, we have a transformation of
the 16-bit input to the 8-bit payload. That's CESU-8. That means the file
C:\temp\é.c is represented by
  43 3a 5c 74 65 6d 70 5c c3 a9 2e 63
   C : \ t e m p \ 303 251 . c

Today, this applies to any system where _WIN32 is defined.

Please note that this payload is compatible with all four types of tools
runnable on Windows (Cygwin, WSL, MSVC and MinGW), so long as they agree on
the root of the filesystem and how to represent separators. We weren't asked
to provide feedback on that.

> > > Are there systems where filenames *that developers use* can't be found
> > via
> > > UTF8?
> >
> > The problem is what happens when the locale isn't UTF-8, which is common
> > enough when LC_ALL=C was set in the environment.
>
> And how common is that (besides you :-)

Setting LC_ALL to C is recommended for any tool that needs to parse the output
of another tool. Searching just qtbase in Qt, I found:

 LC_ALL=C $AWK <script goes here>
 LC_ALL=C $$QMAKE_CXX -E -v -xc++ -
 LC_ALL=C readelf -l /bin/ls

The following commits added the LC_ALL=C to the first and third examples,
after real life failures in deployed code:
https://code.qt.io/cgit/qt/qtbase.git/commit/?id=529a31c967a202458abd126d378a2
https://code.qt.io/cgit/qt/qtbase.git/commit/?id=7839979c07e6f9e1a9c2e038f031f

Note how the AWK example isn't about parsing the output, it's for AWK to parse
another input correctly.

The middle one was introduced with LC_ALL=C from the first patch submission.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-06 22:59:42