sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 5 Sep 2019 11:51:41 +0100

Firstly, NUL is a valid filesystem path codepoint on some platforms, and
I'd like to get the standard fixed on that incorrectness in the near
future. I think that we can reasonably declare the native path separator
codepoint the only invalid filesystem path codepoint, as otherwise
filesystem::path doesn't work.

Secondly, as I've often told you Thiago, the native Windows filesystem
API is also byte based. struct UNICODE takes a *byte* length, not a
wchar_t length. I'll agree that the Win32 path translation layer
complicates that, but underneath it's all byte based, and I would like
to hope that whatever modern i/o proposal WG21 chooses will expose
reality on Windows.

To solve the OP's problem, why doesn't P1689 simply store BOTH the
UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?

The UTF8-attempt edition is where one takes the raw bytes in the native
filesystem encoding, and converts it to UTF-8. Note that even on POSIX,
filesystem paths are not necessarily in valid UTF-8, and ought to be
treated as raw bytes if you want to be able to reopen the original file
after encoding into JSON.

If the raw bytes edition of pathnames in the JSON file is present, it is
used first during lookup. If lookup with the raw byte edition fails, or
if it is not present in the JSON file, the UTF-8 edition is converted to
the native filesystem encoding, and that is used.

This seems to me a reasonable balance of competing factors, and solves
P1689's problem, whilst also keeping the JSON file both human readable
and easily printable by tooling. It also makes a reasonable attempt at
handling the native filesystem encoding changing between uses of the
JSON file, which is a legal possibility.

Niall

On 05/09/2019 05:12, Thiago Macieira wrote:
> Hello Ben, Brad
>
> SG16 was asked to comment on P1689 and how file names can be encoded in a file
> format for exchange of information between different tools in a buildsystem.
> This is not SG16's official reply, it is my own opinion that other members of
> SG16 asked to write after a discussion in our Slack channel.
>
> This is outside of the current scope of the C++ standard, since the standard
> only admits that file names are a sequence of narrow characters not containing
> a NUL, but makes absolutely no determination of what those characters mean. If
> that were sufficient, you wouldn't have asked for an opinion, since all you'd
> need to do would be to encode those bytes somehow. We know that the standard
> is deficient in this area. Notably, the fact that file names on Windows are
> actually stored on the file system and accessed in the low-level API using 16-
> bit wchar_t, instead of the 8-bit char of that platform.
>
> Before I begin, let me say that this paper came as a surprise to those of us
> who are not familiar with SG15's workings. The paper describes a format, but
> does not explain what that format is for. Please revise this paper and try to
> answer some of these questions:
>
> * what tools produce the file?
> * what tools consume the file?
> * how long is the file supposed to last? Is the file supposed to be committed
> to version control?
> * how far is the file supposed to be spread? That is, is networking in scope?
> * what problem does this solve? Is it a new problem?
> * what happens if we don't add this file?
> * what other alternatives were considered? Both as solution and as file
> formats.
> * what happens if the multiple tools do not agree on the view of the
> filesystem (different root, different mountpoints, etc.)? How do you deal with
> this?
>
> == Assumptions ==
>
> Since the paper does not answer those questions above, I am making the
> following assumptions:
>
> 1) the file is an artifact of the build that is not meant to be committed to
> version control. Notably, this means that two builds of the same software are
> not supposed to share this file.
>
> 2) networking is not in scope. Distributed builds are considered an extension
> of the local system, so they don't count as networking. Distributed build
> tools need to emulate the file system of the originator.
>
> 3) different views of the filesystem are out of scope.
>
> == File paths for interchange ==
>
> I propose three options and I will let you choose which one you want. This
> section is only about the file paths ("payload") and is independent of the
> format of the file ("transport"), but I will make references to storing such
> payload in JSON.
>
> The options are:
> - Option 1: file names are Unicode text
> - Option 2: file names are binary
> * 2a: file names are bytes only
> * 2b: file names can be bytes or words
>
> === Option 1: file names are Unicode text ===
> a.k.a. "What could go wrong™?" option
>
> File names and paths are a valid sequence of Unicode codepoints. This is true
> because a file is very often displayed to the user in a shell, command-prompt,
> graphical or text interface, etc. When that happens, file names *are* text.
> This is option is what people *expect* to happen and is therefore the natural
> solution.
>
> In JSON, this means file names are transmitted as Strings (RFC 8259 section
> 7), encoded in UTF-8. In that scenario, you'd open a file name found in the
> payload the following ways:
>
> a) on Windows, use c8strtowcs or c8srtoc16s and pass the result to _wfopen()
> or CreateFileW
>
> b) on other systems, use SG16's proposed c8srtombs ("char8_t string to
> multibyte string") and pass the result to open() or fopen()
>
> c) with Qt, if using QJsonDocument, the pass the string from
> QJsonValue::toString() to QFile.
>
> Consequences:
> 1) easiest implementation. Codecs between UTF-8, UTF-16 and the narrow- and
> wide-character strings are everywhere.
>
> 2) on modern Unix systems, the locale codec is UTF-8, which means the
> implementation is even simpler. Tools can be designed to only support this
> environment and therefore perform a pass-through from UTF-8 payload directly
> to the filesystem and vice-versa.
>
> 3) only file names that can be decoded into the Unicode string are
> permissible. Anything that on Unix mbsrtoc8s fails to decode is
> unrepresentable and therefore should be considered filesystem corruption.
> Similar for WIndows: file names with improperly-paired surrogate code units
> are unrepresentable and therefore filesystem corruption.
>
> 4) changes in the encoding for the narrow- and/or wide-character sets are a
> failure mode and not supported. Notably, changing LANG or LC_ALL on Unix
> systems. This includes setting LC_ALL to "C", something a lot of tools do when
> they parse output from other tools, to ensure the output format they're
> parsing is stable.
>
> === Option 2a: file names are bytes ===
> a.k.a. "Windows developers feel the pain" option
>
> For systems where the filesystem API is implemented using narrow characters
> (that is, bytes), the payload is the exact array of bytes that the API
> provided and accepts. For systems where the API is not using narrow
> characters, a lossless transformation to bytes is required. Transporting those
> bytes in JSON is done by either using Base64 in a JSON String or by using an
> array of JSON numbers.
>
> The only system I know where the native filesystem API is not byte-based is
> Windows. So for Windows, the file names are transformed using CESU-8 / WTF-8,
> *not* UTF-8. That is, any surrogate code units found in the file name are
> stored as the 3-byte UTF-8 encoding of each, not the 4-byte encoding of the
> UTF-32 code point they're supposed to represent.
>
> This solution is lossless and can represent all possible file names.
>
> To open such a file, you'd do:
>
> a) on Windows, convert the byte array from CESU-8 / WTF-8 to WTF-16
> ("potentially ill-formed UTF-16"), then pass the file name to _wfopen() or
> CreateFile()
>
> b) on other systems, pass the byte array directly to open() or fopen()
>
> c) with Qt, convert the byte array from CESU-8 / WTF-8 to WTF-16 and pass the
> resulting QString to QFile
>
> Consequences:
> 1) easiest for Unix, since it's pass-through. However, for Windows and other
> UTF-16-using APIs, there's a non-trivial hurdle. The implementation for CESU-8
> encoding and decoding is *not* provided in the standard library and is not
> usually found in Unicode libraries. In fact, using compliant UTF-8 encoders
> and decoders is *not* permitted in this solution.
>
> === Option 2b: file names are bytes or words ===
> a.k.a. "spread the pain" option
>
> This is an extension of option 2a. It admits that file names on Windows are
> actually composed of 16-bit units and permits those as the payload. So the
> file names are stored in the payload with a tag indicating whether the
> contents are 8-bit or 16-bit.
>
> Native Windows tools therefore can perform pass-through, if the payload is
> stored 16-bit. The problem is that both 8- and 16-bit are allowed, which means
> all tools need to deal with both possibilities.
>
> I) if the payload is stored as 8-bit, do as option 2a
> II) if the payload is stored as 16-bit, then:
> a) on Windows and with Qt, pass-through
>
> b) on other systems, assume it's WTF-16 and encode as CESU-8, then pass to
> open() or fopen()
>
> The rationale for Unix systems also dealing with 16-bit units is because of
> Cygwin and WSL. See analysis below.
>
> == Windows Analysis ==
> I can think of four relevant build environments for Windows, which form two
> distinct groups today, plus a theoretical third that currently does not exist:
>
> 1) native applications built with MSVC (ucrt.dll); _WIN32 is defined
> 2) native applications built with MinGW (crtdll.dll); _WIN32 is defined
> 3) Unix applications built with Cygwin / MSYS2; _WIN32 is not defined
> 4) Unix applications built for Linux, run under WSL; _WIN32 is not defined
>
> It is conceivable that these four types of applications are all mixed together
> in a single build, so they could be sharing the same data that P1689 is meant
> to share. And CMake is the prime example of this: it can be any of the four,
> driving a make and a compiler that is any of the four too.
>
> The three groups are:
>
> a) Wide API available and narrow is ANSI (1 and 2 above)
> b) Wide API is available and narrow is UTF-8 (theoretical)
> c) no Wide API, narrow is UTF-8 (3 and 4)
>
> Group c only has open() and fopen() available. Fortunately, the Cygwin/MSYS2
> runtime take the narrow character input and converts to wchar_t using UTF-8 (I
> don't know whether it's CESU-8), so those applications just work. For them,
> option 2a is a pass-through; option 2b requires the UTF-16 to UTF-8 codec,
> then pass though; and option 1 admits the pass-through solution with an
> #ifdef.
>
> Group b has both APIs available. For this group, pass through is available in
> both options 2a and 2b and can take the shortcut on option 1.
>
> For both groups b and c, Unix applications can be rebuilt on Windows with
> little to no porting.
>
> Group a MUST NOT use _open() and fopen(). No exceptions. This means Unix
> applications must be ported to Windows in order to operate properly if
> compiled with those compilers, so that they will use _wfopen() or
> CreateFileW(). For those, pass-through is only possible under option 2b, if
> the payload is 16-bit.
>
> == Transport ==
> P1689 suggests using JSON. I'm comparing that in the context of the three
> options with a binary format (CBOR).
>
> One thing SG16 is completely in agreement of is that if you go with JSON, you
> must obey RFC 8259: there must not be a BOM and the file must be encoded in
> UTF-8.
>
> Option 1) Since file names are text, JSON is actually well-placed and the file
> names are stored as JSON Strings. This is easy to debug in any UTF-8 capable
> text editor, though of course one that understands JSON is recommended. Most
> JSON APIs provide strings directly in UTF-8, so that content can be passed to
> the UTF-8 to locale encoder / decoder. CBOR also stores text strings as UTF-8,
> so the same ease of encoding and decoding to the locale is there.
>
> Option 2a) File names are binary data, so they MOST NOT be stored as-is in
> JSON strings. I recommend either base64 in a string or an array of numbers.
> For this, a binary solution is better: CBOR has a type called "byte string",
> which can store binary data.
>
> Option 2b) is an extension of 2a. You store the payload the same way, except
> that you must also store a tag indicating whether the data was 8 or 16-bit. If
> using Base64, it must also indicate whether it's big-endian or little (this
> problem does not exist for an array of numbers). The same constraints apply to
> CBOR and I do not recommend storing as an array of numbers as that will double
> the space necessary to store compared to a byte string and will be sloer to
> encode and decode.
>
> This is it. I know this is a long email, but hopefully it helps you come to
> some conclusions.
>
>

Received on 2019-09-05 12:51:45