C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 06 Sep 2019 14:33:00 -0700
On Friday, 6 September 2019 11:41:06 PDT Tony V E wrote:
> Let's say the filename is representable in UTF8 - that's the restriction
> I'm suggesting.

I'm interpreting this in two cases:
 1) on Unix, the bag of 8-bit bytes obtained from the FS API can be decoded
    using UTF-8
 2) on Windows, the bag of 16-bit words can be decoded using UTF-16,
   which means I can encode it to 8-bit with UTF-8

The problem is that when you do this, you exclude the very cases that are
problematic. If we assume that there is no problem, then there is no problem.
If we restrict the usage to the cases where there is no problem, there is no
problem.

> Let's say I'm a tool such as an IDE. I have a filename, probably read from
> the filesystem in whatever encoding the IDE is currently running in. (ie
> NOT read from the SG15 format)

Sorry, this does not apply.

There are two types of IDEs here:
 a) those, like Qt Creator, that always convert file names from the locale's
codec to an internal format (UTF-16, for Qt Creator); possibly the same for
Java-based ones
 b) those that keep the file names in their native format, be it 8- or 16-bit.
I suspect MSVC is in this category. Your regular Unix text editor (Vim, Emacs,
JOE, etc.) is also here.

For the IDEs (a), any file name that cannot be properly decoded from the FS
"bag of bits" to Unicode text is filesystem corruption. You can't open that
file, you can't delete that file, you can't refer to that file in the command-
line to another process or by writing its name in a socket, pipe or another
file's body. When using QDir or QDirIterator, those files are silently
skipped, so you won't even know they were there.

For case (b), there's no encoding involved. The IDE kept the file name in the
same "bag of bits" as it received from the OS.

You're describing case (a), which again implies resolving the problem by
declaring the problem cases to be out of scope. If we ignore the problems,
there is no problem left.

> So let's say I read the filename using the right filesystem encoding, and
> have the name in that encoding. (Or somehow managed to convert, etc)
> Now I want to WRITE the SG15 format.
> I have a filename, and have the filesystem encoding. Because we said the
> filename is representable in UTF8, I can convert, yes? (Now, maybe there
> is more than one conversion... normalization...)

Why would you save it in UTF-8, knowing that the other tool that is going to
read could be under a different assumption of what codec to use?

Why not instead save the same bag of bits that you received from the OS, which
you know the OS can use to refer back to the same file? The environment has
not changed during the run of the current application, so it can perform back
and forth translations from the bag of bits to the internal representation,
losslessly.

> OK, now I'm another tool, and want to read the SG15 format.
>
> I'm running with some filesystem encoding. I have a UTF8 filename. Can I
> convert to filesystem encoding?

No. This is the failure mode: if the file name was stored in UTF-8 and I don't
know what the source used to decode the bag of bits to Unicode, I can't be
sure to reproduce the same bag of bits.

> Well if my filesystem is ASCII-only, then maybe not. But we can't fix
> that. The user needs to know that their filesystem restricts them to ASCII
> only. Similarly if their filesystem is some other subset of Unicode.

And as I've established, running with LC_ALL=C is a valid scenario, which is
"filesystem is ASCII-only" → "we can't fix that".

> But if the filesystem can represent all of UTF8 (or all of UTF8 that is in
> the filename) into its encoding, and you have the current filesystem
> encoding (maybe different encoding than the UTF8 originally converted
> from), you can do UTF8 -> filesystem?

No, we can't, even if we limit ourselves to Unicode-capable filesystems, we
have a problem.

Let's say the file is named "测试.cpp" (I used Google translate for "test", I
don't know if this is proper Chinese). That means the SG15 file contains a
payload that was U+6D4B U+8BD5 U+002E U+0063 U+0070 U+0070. When the tool will
convert from UTF-8 to filesystem, I can think of two valid 8-bit capable full-
Unicode codecs. The first is UTF-8 itself, which results in bytes:
 e6 b5 8b e8 af 95 2e 63 70 70
The other is GB18030, where we have bytes:
 b2 e2 ca d4 2e 63 70 70

> At which step(s) can things go wrong?

All of them, starting from the delineation of the problem space.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-06 23:33:03