Date: Fri, 6 Sep 2019 09:38:45 -0400
On 9/6/19 8:46 AM, Niall Douglas wrote:
> On 06/09/2019 05:18, Thiago Macieira wrote:
>> On Thursday, 5 September 2019 03:51:41 PDT Niall Douglas wrote:
>>> To solve the OP's problem, why doesn't P1689 simply store BOTH the
>>> UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?
>>
>> That's what the paper currently proposes. My argument is that you should
>> choose one only.
>
> My reading of their paper was that they want to encode non-UTF8
> sequences into UTF8 paths in a JSON file. I don't think they should take
> that path, because it loses too much information.
We'll have to clarify the wording. We propose two allowed representations:
- An array of integers tagged with the corresponding size of values in memory.
This can represent an arbitrary binary sequence and is the general form.
This variant also allows a "readable-name" field intended only for human
consumption that is not meant for use in accessing the filesystem.
It is optional and superfluous for tooling but useful for debugging.
- UTF-8. This is allowed *only if a lossless round trip* is possible
between the filesystem's native binary sequence and UTF-8. E.g. on
Windows we should not have to require the full general format to represent
a simple path like "a.cxx" just because the filesystem APIs use wide chars.
This is intended for the common use case of ASCII-only file paths to make
the format simpler and more human readable (e.g. for debugging). We then
generalize beyond ASCII to allow any lossless UTF-8 round-trip (implying
that the locale does not change).
-Brad
> On 06/09/2019 05:18, Thiago Macieira wrote:
>> On Thursday, 5 September 2019 03:51:41 PDT Niall Douglas wrote:
>>> To solve the OP's problem, why doesn't P1689 simply store BOTH the
>>> UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?
>>
>> That's what the paper currently proposes. My argument is that you should
>> choose one only.
>
> My reading of their paper was that they want to encode non-UTF8
> sequences into UTF8 paths in a JSON file. I don't think they should take
> that path, because it loses too much information.
We'll have to clarify the wording. We propose two allowed representations:
- An array of integers tagged with the corresponding size of values in memory.
This can represent an arbitrary binary sequence and is the general form.
This variant also allows a "readable-name" field intended only for human
consumption that is not meant for use in accessing the filesystem.
It is optional and superfluous for tooling but useful for debugging.
- UTF-8. This is allowed *only if a lossless round trip* is possible
between the filesystem's native binary sequence and UTF-8. E.g. on
Windows we should not have to require the full general format to represent
a simple path like "a.cxx" just because the filesystem APIs use wide chars.
This is intended for the common use case of ASCII-only file paths to make
the format simpler and more human readable (e.g. for debugging). We then
generalize beyond ASCII to allow any lossless UTF-8 round-trip (implying
that the locale does not change).
-Brad
Received on 2019-09-06 15:38:48