On Fri, Sep 6, 2019 at 5:33 PM Thiago Macieira <thiago@macieira.org> wrote:

On Friday, 6 September 2019 11:41:06 PDT Tony V E wrote:
> Let's say the filename is representable in UTF8 - that's the restriction
> I'm suggesting.

I'm interpreting this in two cases:
1) on Unix, the bag of 8-bit bytes obtained from the FS API can be decoded
using UTF-8
2) on Windows, the bag of 16-bit words can be decoded using UTF-16,
which means I can encode it to 8-bit with UTF-8

The problem is that when you do this, you exclude the very cases that are
problematic. If we assume that there is no problem, then there is no problem.
If we restrict the usage to the cases where there is no problem, there is no
problem.

I know, it's a great solution! :-)

Niall's reply gave me the impression that even with this restriction, there would still be problems. Thus my scenario.

> Let's say I'm a tool such as an IDE. I have a filename, probably read from
> the filesystem in whatever encoding the IDE is currently running in. (ie
> NOT read from the SG15 format)

Sorry, this does not apply.

There are two types of IDEs here:
a) those, like Qt Creator, that always convert file names from the locale's
codec to an internal format (UTF-16, for Qt Creator); possibly the same for
Java-based ones
b) those that keep the file names in their native format, be it 8- or 16-bit.
I suspect MSVC is in this category. Your regular Unix text editor (Vim, Emacs,
JOE, etc.) is also here.

For the IDEs (a), any file name that cannot be properly decoded from the FS
"bag of bits" to Unicode text is filesystem corruption. You can't open that
file, you can't delete that file, you can't refer to that file in the command-
line to another process or by writing its name in a socket, pipe or another
file's body. When using QDir or QDirIterator, those files are silently
skipped, so you won't even know they were there.

For case (b), there's no encoding involved. The IDE kept the file name in the
same "bag of bits" as it received from the OS.

You're describing case (a), which again implies resolving the problem by
declaring the problem cases to be out of scope.

Well, I was imagining that the IDE kept or converted it in whatever format it wanted, but it read it in some native format, and has enough info about that native format to convert to UTF8. *When*it actually does the conversion (ie when reading, or later when writing the SG15 file) doesn't matter (I think).

If we ignore the problems,
there is no problem left.

Works in real life too, sometimes.:-)

I'm trying to get a handle on how bad ignoring the problem would be, and whether there are still other problems.

> So let's say I read the filename using the right filesystem encoding, and
> have the name in that encoding. (Or somehow managed to convert, etc)
> Now I want to WRITE the SG15 format.
> I have a filename, and have the filesystem encoding. Because we said the
> filename is representable in UTF8, I can convert, yes? (Now, maybe there
> is more than one conversion... normalization...)

Why would you save it in UTF-8, knowing that the other tool that is going to
read could be under a different assumption of what codec to use?

Why not instead save the same bag of bits that you received from the OS, which
you know the OS can use to refer back to the same file? The environment has
not changed during the run of the current application, so it can perform back
and forth translations from the bag of bits to the internal representation,
losslessly.

How do I know the environment hasn't changed when the other program (the reading one) runs? The SG15 was written by one program, then _later_ read by another.

Are these two programs even on the same OS, or do they just have access to the same files?

> OK, now I'm another tool, and want to read the SG15 format.
>
> I'm running with some filesystem encoding. I have a UTF8 filename. Can I
> convert to filesystem encoding?

No. This is the failure mode: if the file name was stored in UTF-8 and I don't
know what the source used to decode the bag of bits to Unicode, I can't be
sure to reproduce the same bag of bits.

If I have the filename in unicode, and the original filename was unicode-able, do I need the same bag of bits, or does every OS have an API for "find this file, here's the unicode name".

> Well if my filesystem is ASCII-only, then maybe not. But we can't fix
> that. The user needs to know that their filesystem restricts them to ASCII
> only. Similarly if their filesystem is some other subset of Unicode.

And as I've established, running with LC_ALL=C is a valid scenario, which is
"filesystem is ASCII-only" → "we can't fix that".

> But if the filesystem can represent all of UTF8 (or all of UTF8 that is in
> the filename) into its encoding, and you have the current filesystem
> encoding (maybe different encoding than the UTF8 originally converted
> from), you can do UTF8 -> filesystem?

No, we can't, even if we limit ourselves to Unicode-capable filesystems, we
have a problem.

Let's say the file is named "测试.cpp" (I used Google translate for "test", I
don't know if this is proper Chinese). That means the SG15 file contains a
payload that was U+6D4B U+8BD5 U+002E U+0063 U+0070 U+0070. When the tool will
convert from UTF-8 to filesystem, I can think of two valid 8-bit capable full-
Unicode codecs. The first is UTF-8 itself, which results in bytes:
e6 b5 8b e8 af 95 2e 63 70 70
The other is GB18030, where we have bytes:
b2 e2 ca d4 2e 63 70 70

As above, does every OS have an API for "find this file, here's a/the unicode name, and the filename really was unicode from the start".

> At which step(s) can things go wrong?

All of them, starting from the delineation of the problem space.

Yes, I'm wondering if we can make the problem space smaller, since developers and tools have lots of control over the filenames they use.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel System Software Products

Thanks for your explanations.

Be seeing you,

Tony