C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Brad King <brad.king_at_[hidden]>
Date: Fri, 6 Sep 2019 09:38:45 -0400
On 9/6/19 8:46 AM, Niall Douglas wrote:
> On 06/09/2019 05:18, Thiago Macieira wrote:
>> On Thursday, 5 September 2019 03:51:41 PDT Niall Douglas wrote:
>>> To solve the OP's problem, why doesn't P1689 simply store BOTH the
>>> UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?
>>
>> That's what the paper currently proposes. My argument is that you should
>> choose one only.
>
> My reading of their paper was that they want to encode non-UTF8
> sequences into UTF8 paths in a JSON file. I don't think they should take
> that path, because it loses too much information.

We'll have to clarify the wording. We propose two allowed representations:

- An array of integers tagged with the corresponding size of values in memory.
  This can represent an arbitrary binary sequence and is the general form.

  This variant also allows a "readable-name" field intended only for human
  consumption that is not meant for use in accessing the filesystem.
  It is optional and superfluous for tooling but useful for debugging.

- UTF-8. This is allowed *only if a lossless round trip* is possible
  between the filesystem's native binary sequence and UTF-8. E.g. on
  Windows we should not have to require the full general format to represent
  a simple path like "a.cxx" just because the filesystem APIs use wide chars.

  This is intended for the common use case of ASCII-only file paths to make
  the format simpler and more human readable (e.g. for debugging). We then
  generalize beyond ASCII to allow any lossless UTF-8 round-trip (implying
  that the locale does not change).

-Brad

Received on 2019-09-06 15:38:48