C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Tony V E <tvaneerd_at_[hidden]>
Date: Fri, 6 Sep 2019 17:09:44 -0400
On Fri, Sep 6, 2019 at 4:59 PM Thiago Macieira <thiago_at_[hidden]> wrote:

> On Friday, 6 September 2019 13:09:23 PDT Tony V E wrote:
> > On Fri, Sep 6, 2019 at 3:52 PM Thiago Macieira <thiago_at_[hidden]>
> wrote:
> > > > - if you encode the raw bytes, there might still be cases not
> covered,
> > > > might need to fall back to UTF8. It sounds like *no* answer will be
> > > > guaranteed to work.
> > >
> > > Which case could there be that the raw bytes fail but UTF-8 supports? I
> > > would
> > > think it's the other way around.
> >
> > the case where the encoding changed. Or the raw bytes are being used
> with
> > the wrong FS API.
>
> The encoding is only needed when converting raw bytes to text. Since
> there's
> no conversion, the raw bytes are passed through unmodified from payload to
> filesystem API and from filesystem API to the payload.
>
>
If I know which API it was from, and have it available to me. And the
filesystem encoding hasn't changed since then. Niall gives me the
impression that can change. (Or is that only the display encoding that can
change?)

I know you listed all the rules for many scenarios (on linux do..., on MS
do...) but it seems a bit precarious to me. What happens when a new FS API
comes around, or some other OS, EBCIDIC, etc?

How portable do we want/need this interchange files to be?



> The two files below are simply bags of bytes:
>
> $ for f in /tmp/*.c; do paste - <<<$f <(printf $f | xxd -ps); done
> /tmp/�.c 2f746d702fe92e63
> /tmp/é.c 2f746d702fc3a92e63
>
> The pass-through is possible for all systems where the native FS API is
> 8-bit.
> This includes Cygwin and WSL.
>
> For systems where the FS API is natively 16-bit, we have a transformation
> of
> the 16-bit input to the 8-bit payload. That's CESU-8. That means the file
> C:\temp\é.c is represented by
> 43 3a 5c 74 65 6d 70 5c c3 a9 2e 63
> C : \ t e m p \ 303 251 . c
>
> Today, this applies to any system where _WIN32 is defined.
>
> Please note that this payload is compatible with all four types of tools
> runnable on Windows (Cygwin, WSL, MSVC and MinGW), so long as they agree
> on
> the root of the filesystem and how to represent separators. We weren't
> asked
> to provide feedback on that.
>
> > > > Are there systems where filenames *that developers use* can't be
> found
> > > via
> > > > UTF8?
> > >
> > > The problem is what happens when the locale isn't UTF-8, which is
> common
> > > enough when LC_ALL=C was set in the environment.
> >
> > And how common is that (besides you :-)
>
> Setting LC_ALL to C is recommended for any tool that needs to parse the
> output
> of another tool. Searching just qtbase in Qt, I found:
>
> LC_ALL=C $AWK <script goes here>
> LC_ALL=C $$QMAKE_CXX -E -v -xc++ -
> LC_ALL=C readelf -l /bin/ls
>
> The following commits added the LC_ALL=C to the first and third examples,
> after real life failures in deployed code:
>
> https://code.qt.io/cgit/qt/qtbase.git/commit/?id=529a31c967a202458abd126d378a2
>
> https://code.qt.io/cgit/qt/qtbase.git/commit/?id=7839979c07e6f9e1a9c2e038f031f
>
> Note how the AWK example isn't about parsing the output, it's for AWK to
> parse
> another input correctly.
>
> The middle one was introduced with LC_ALL=C from the first patch
> submission.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel System Software Products
>
>
>
>

-- 
Be seeing you,
Tony

Received on 2019-09-06 23:10:02