C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Fri, 6 Sep 2019 13:46:51 +0100
On 06/09/2019 05:18, Thiago Macieira wrote:
> On Thursday, 5 September 2019 03:51:41 PDT Niall Douglas wrote:
>> To solve the OP's problem, why doesn't P1689 simply store BOTH the
>> UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?
>
> That's what the paper currently proposes. My argument is that you should
> choose one only.

My reading of their paper was that they want to encode non-UTF8
sequences into UTF8 paths in a JSON file. I don't think they should take
that path, because it loses too much information.

>> The UTF8-attempt edition is where one takes the raw bytes in the native
>> filesystem encoding, and converts it to UTF-8. Note that even on POSIX,
>> filesystem paths are not necessarily in valid UTF-8, and ought to be
>> treated as raw bytes if you want to be able to reopen the original file
>> after encoding into JSON.
>>
>> If the raw bytes edition of pathnames in the JSON file is present, it is
>> used first during lookup. If lookup with the raw byte edition fails, or
>> if it is not present in the JSON file, the UTF-8 edition is converted to
>> the native filesystem encoding, and that is used.
>
> Sorry Niall, I don't think this will work.
>
> If the raw bytes edition is optional, then it means a valid payload can
> include only the UTF-8 representation in the JSON String.

I was thinking of the situation where a script, or human, manually edits
the paths in a JSON file. As they will only grok UTF8, they would need
to delete entirely the binary path representation to have it
disregarded, as they cannot usefully modify it.

>
 But that opens the
> possibility that two tools will disagree as to what file it represents. For
> example:
> {
> "file": "/tmp/é.c"
> }
>
> $ ls -1ib *.c
> 5303210 \351.c
> 5303209 é.c
>
> $ LC_ALL=en_US.ISO-8859-1 ls -1ib *.c | iconv -f latin1
> 5303209 é.c
> 5303210 é.c
>
> Which of the two inodes is the JSON file referring to?

Absolutely right. If you delete the binary path representation, you get
problems like this. But, equally, you have to allow third party tooling
to modify the paths in the JSON for whatever reason. The cost is exactly
the problem you describe.

> Using the UTF-8 encoded text is Option 1 in my proposal. I don't have a
> problem with it, but if adopted, then implementers need to understand the
> problems shown above in the ls outputs will happen (note how there's a second
> issue).
>
> If the raw form is mandatory, then the text form is superfluous. That's both
> options 2, differing only on what raw forms are required.

The text form is to handle different native filesystem encodings. A
platform is permitted to have one native filesystem encoding in one
program, and a different native filesystem encoding in another program.
For example, ANSI vs UNICODE Windows programs. Both programs may work on
the same JSON file. They need some common mechanism to communicate if
they use dissimilar native filesystem encodings, and a UTF8-attempt as a
fallback is as good as any.

Niall

Received on 2019-09-06 14:46:54