C++ Logo


Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Thu, 05 Sep 2019 21:18:38 -0700
On Thursday, 5 September 2019 03:51:41 PDT Niall Douglas wrote:
> To solve the OP's problem, why doesn't P1689 simply store BOTH the
> UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?

That's what the paper currently proposes. My argument is that you should
choose one only.

> The UTF8-attempt edition is where one takes the raw bytes in the native
> filesystem encoding, and converts it to UTF-8. Note that even on POSIX,
> filesystem paths are not necessarily in valid UTF-8, and ought to be
> treated as raw bytes if you want to be able to reopen the original file
> after encoding into JSON.
> If the raw bytes edition of pathnames in the JSON file is present, it is
> used first during lookup. If lookup with the raw byte edition fails, or
> if it is not present in the JSON file, the UTF-8 edition is converted to
> the native filesystem encoding, and that is used.

Sorry Niall, I don't think this will work.

If the raw bytes edition is optional, then it means a valid payload can
include only the UTF-8 representation in the JSON String. But that opens the
possibility that two tools will disagree as to what file it represents. For
    "file": "/tmp/.c"

$ ls -1ib *.c
5303210 \351.c
5303209 .c

$ LC_ALL=en_US.ISO-8859-1 ls -1ib *.c | iconv -f latin1
5303209 é.c
5303210 .c

Which of the two inodes is the JSON file referring to?

Using the UTF-8 encoded text is Option 1 in my proposal. I don't have a
problem with it, but if adopted, then implementers need to understand the
problems shown above in the ls outputs will happen (note how there's a second

If the raw form is mandatory, then the text form is superfluous. That's both
options 2, differing only on what raw forms are required.

Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-06 06:18:42