sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Tony V E <tvaneerd_at_[hidden]>
Date: Fri, 6 Sep 2019 18:01:39 -0400

On Fri, Sep 6, 2019 at 5:33 PM Thiago Macieira <thiago_at_[hidden]> wrote:

> On Friday, 6 September 2019 11:41:06 PDT Tony V E wrote:
> > Let's say the filename is representable in UTF8 - that's the restriction
> > I'm suggesting.
>
> I'm interpreting this in two cases:
> 1) on Unix, the bag of 8-bit bytes obtained from the FS API can be
> decoded
> using UTF-8
> 2) on Windows, the bag of 16-bit words can be decoded using UTF-16,
> which means I can encode it to 8-bit with UTF-8
>
> The problem is that when you do this, you exclude the very cases that are
> problematic. If we assume that there is no problem, then there is no
> problem.
> If we restrict the usage to the cases where there is no problem, there is
> no
> problem.
>

I know, it's a great solution! :-)
Niall's reply gave me the impression that even with this restriction, there
would still be problems. Thus my scenario.

> > Let's say I'm a tool such as an IDE. I have a filename, probably read
> from
> > the filesystem in whatever encoding the IDE is currently running in. (ie
> > NOT read from the SG15 format)
>
> Sorry, this does not apply.
>
> There are two types of IDEs here:
> a) those, like Qt Creator, that always convert file names from the
> locale's
> codec to an internal format (UTF-16, for Qt Creator); possibly the same
> for
> Java-based ones
> b) those that keep the file names in their native format, be it 8- or
> 16-bit.
> I suspect MSVC is in this category. Your regular Unix text editor (Vim,
> Emacs,
> JOE, etc.) is also here.
>
> For the IDEs (a), any file name that cannot be properly decoded from the
> FS
> "bag of bits" to Unicode text is filesystem corruption. You can't open
> that
> file, you can't delete that file, you can't refer to that file in the
> command-
> line to another process or by writing its name in a socket, pipe or
> another
> file's body. When using QDir or QDirIterator, those files are silently
> skipped, so you won't even know they were there.
>
> For case (b), there's no encoding involved. The IDE kept the file name in
> the
> same "bag of bits" as it received from the OS.
>
> You're describing case (a), which again implies resolving the problem by
> declaring the problem cases to be out of scope.

Well, I was imagining that the IDE kept or converted it in whatever format
it wanted, but it read it in some native format, and has enough info about
that native format to convert to UTF8. *When*it actually does the
conversion (ie when reading, or later when writing the SG15 file) doesn't
matter (I think).

> If we ignore the problems,
> there is no problem left.
>

Works in real life too, sometimes.:-)
I'm trying to get a handle on how bad ignoring the problem would be, and
whether there are still other problems.

> > So let's say I read the filename using the right filesystem encoding, and
> > have the name in that encoding. (Or somehow managed to convert, etc)
> > Now I want to WRITE the SG15 format.
> > I have a filename, and have the filesystem encoding. Because we said the
> > filename is representable in UTF8, I can convert, yes? (Now, maybe there
> > is more than one conversion... normalization...)
>
> Why would you save it in UTF-8, knowing that the other tool that is going
> to
> read could be under a different assumption of what codec to use?
>
> Why not instead save the same bag of bits that you received from the OS,
> which
> you know the OS can use to refer back to the same file? The environment
> has
> not changed during the run of the current application, so it can perform
> back
> and forth translations from the bag of bits to the internal
> representation,
> losslessly.
>
>
How do I know the environment hasn't changed when the other program (the
reading one) runs? The SG15 was written by one program, then _later_ read
by another.
Are these two programs even on the same OS, or do they just have access to
the same files?

> > OK, now I'm another tool, and want to read the SG15 format.
> >
> > I'm running with some filesystem encoding. I have a UTF8 filename. Can
> I
> > convert to filesystem encoding?
>
> No. This is the failure mode: if the file name was stored in UTF-8 and I
> don't
> know what the source used to decode the bag of bits to Unicode, I can't be
> sure to reproduce the same bag of bits.
>

If I have the filename in unicode, and the original filename was
unicode-able, do I need the same bag of bits, or does every OS have an API
for "find this file, here's the unicode name".

> > Well if my filesystem is ASCII-only, then maybe not. But we can't fix
> > that. The user needs to know that their filesystem restricts them to
> ASCII
> > only. Similarly if their filesystem is some other subset of Unicode.
>
> And as I've established, running with LC_ALL=C is a valid scenario, which
> is
> "filesystem is ASCII-only" → "we can't fix that".
>
> > But if the filesystem can represent all of UTF8 (or all of UTF8 that is
> in
> > the filename) into its encoding, and you have the current filesystem
> > encoding (maybe different encoding than the UTF8 originally converted
> > from), you can do UTF8 -> filesystem?
>
> No, we can't, even if we limit ourselves to Unicode-capable filesystems,
> we
> have a problem.
>
> Let's say the file is named "测试.cpp" (I used Google translate for "test",
> I
> don't know if this is proper Chinese). That means the SG15 file contains a
> payload that was U+6D4B U+8BD5 U+002E U+0063 U+0070 U+0070. When the tool
> will
> convert from UTF-8 to filesystem, I can think of two valid 8-bit capable
> full-
> Unicode codecs. The first is UTF-8 itself, which results in bytes:
> e6 b5 8b e8 af 95 2e 63 70 70
> The other is GB18030, where we have bytes:
> b2 e2 ca d4 2e 63 70 70
>
>
As above, does every OS have an API for "find this file, here's a/the
unicode name, and the filename really was unicode from the start".

> At which step(s) can things go wrong?
>
> All of them, starting from the delineation of the problem space.
>

Yes, I'm wondering if we can make the problem space smaller, since
developers and tools have lots of control over the filenames they use.

> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel System Software Products
>
>
Thanks for your explanations.

-- 
Be seeing you,
Tony

Received on 2019-09-07 00:01:57