sg16: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Wed, 04 Sep 2019 21:12:53 -0700

Hello Ben, Brad

SG16 was asked to comment on P1689 and how file names can be encoded in a file
format for exchange of information between different tools in a buildsystem.
This is not SG16's official reply, it is my own opinion that other members of
SG16 asked to write after a discussion in our Slack channel.

This is outside of the current scope of the C++ standard, since the standard
only admits that file names are a sequence of narrow characters not containing
a NUL, but makes absolutely no determination of what those characters mean. If
that were sufficient, you wouldn't have asked for an opinion, since all you'd
need to do would be to encode those bytes somehow. We know that the standard
is deficient in this area. Notably, the fact that file names on Windows are
actually stored on the file system and accessed in the low-level API using 16-
bit wchar_t, instead of the 8-bit char of that platform.

Before I begin, let me say that this paper came as a surprise to those of us
who are not familiar with SG15's workings. The paper describes a format, but
does not explain what that format is for. Please revise this paper and try to
answer some of these questions:

* what tools produce the file?
* what tools consume the file?
* how long is the file supposed to last? Is the file supposed to be committed
   to version control?
* how far is the file supposed to be spread? That is, is networking in scope?
* what problem does this solve? Is it a new problem?
* what happens if we don't add this file?
* what other alternatives were considered? Both as solution and as file
formats.
* what happens if the multiple tools do not agree on the view of the
filesystem (different root, different mountpoints, etc.)? How do you deal with
this?

== Assumptions ==

Since the paper does not answer those questions above, I am making the
following assumptions:

1) the file is an artifact of the build that is not meant to be committed to
version control. Notably, this means that two builds of the same software are
not supposed to share this file.

2) networking is not in scope. Distributed builds are considered an extension
of the local system, so they don't count as networking. Distributed build
tools need to emulate the file system of the originator.

3) different views of the filesystem are out of scope.

== File paths for interchange ==

I propose three options and I will let you choose which one you want. This
section is only about the file paths ("payload") and is independent of the
format of the file ("transport"), but I will make references to storing such
payload in JSON.

The options are:
- Option 1: file names are Unicode text
- Option 2: file names are binary
    * 2a: file names are bytes only
    * 2b: file names can be bytes or words

=== Option 1: file names are Unicode text ===
a.k.a. "What could go wrong™?" option

File names and paths are a valid sequence of Unicode codepoints. This is true
because a file is very often displayed to the user in a shell, command-prompt,
graphical or text interface, etc. When that happens, file names *are* text.
This is option is what people *expect* to happen and is therefore the natural
solution.

In JSON, this means file names are transmitted as Strings (RFC 8259 section
7), encoded in UTF-8. In that scenario, you'd open a file name found in the
payload the following ways:

a) on Windows, use c8strtowcs or c8srtoc16s and pass the result to _wfopen()
or CreateFileW

b) on other systems, use SG16's proposed c8srtombs ("char8_t string to
multibyte string") and pass the result to open() or fopen()

c) with Qt, if using QJsonDocument, the pass the string from
QJsonValue::toString() to QFile.

Consequences:
1) easiest implementation. Codecs between UTF-8, UTF-16 and the narrow- and
wide-character strings are everywhere.

2) on modern Unix systems, the locale codec is UTF-8, which means the
implementation is even simpler. Tools can be designed to only support this
environment and therefore perform a pass-through from UTF-8 payload directly
to the filesystem and vice-versa.

3) only file names that can be decoded into the Unicode string are
permissible. Anything that on Unix mbsrtoc8s fails to decode is
unrepresentable and therefore should be considered filesystem corruption.
Similar for WIndows: file names with improperly-paired surrogate code units
are unrepresentable and therefore filesystem corruption.

4) changes in the encoding for the narrow- and/or wide-character sets are a
failure mode and not supported. Notably, changing LANG or LC_ALL on Unix
systems. This includes setting LC_ALL to "C", something a lot of tools do when
they parse output from other tools, to ensure the output format they're
parsing is stable.

=== Option 2a: file names are bytes ===
a.k.a. "Windows developers feel the pain" option

For systems where the filesystem API is implemented using narrow characters
(that is, bytes), the payload is the exact array of bytes that the API
provided and accepts. For systems where the API is not using narrow
characters, a lossless transformation to bytes is required. Transporting those
bytes in JSON is done by either using Base64 in a JSON String or by using an
array of JSON numbers.

The only system I know where the native filesystem API is not byte-based is
Windows. So for Windows, the file names are transformed using CESU-8 / WTF-8,
*not* UTF-8. That is, any surrogate code units found in the file name are
stored as the 3-byte UTF-8 encoding of each, not the 4-byte encoding of the
UTF-32 code point they're supposed to represent.

This solution is lossless and can represent all possible file names.

To open such a file, you'd do:

a) on Windows, convert the byte array from CESU-8 / WTF-8 to WTF-16
("potentially ill-formed UTF-16"), then pass the file name to _wfopen() or
CreateFile()

b) on other systems, pass the byte array directly to open() or fopen()

c) with Qt, convert the byte array from CESU-8 / WTF-8 to WTF-16 and pass the
resulting QString to QFile

Consequences:
1) easiest for Unix, since it's pass-through. However, for Windows and other
UTF-16-using APIs, there's a non-trivial hurdle. The implementation for CESU-8
encoding and decoding is *not* provided in the standard library and is not
usually found in Unicode libraries. In fact, using compliant UTF-8 encoders
and decoders is *not* permitted in this solution.

=== Option 2b: file names are bytes or words ===
a.k.a. "spread the pain" option

This is an extension of option 2a. It admits that file names on Windows are
actually composed of 16-bit units and permits those as the payload. So the
file names are stored in the payload with a tag indicating whether the
contents are 8-bit or 16-bit.

Native Windows tools therefore can perform pass-through, if the payload is
stored 16-bit. The problem is that both 8- and 16-bit are allowed, which means
all tools need to deal with both possibilities.

I) if the payload is stored as 8-bit, do as option 2a
II) if the payload is stored as 16-bit, then:
a) on Windows and with Qt, pass-through

b) on other systems, assume it's WTF-16 and encode as CESU-8, then pass to
open() or fopen()

The rationale for Unix systems also dealing with 16-bit units is because of
Cygwin and WSL. See analysis below.

== Windows Analysis ==
I can think of four relevant build environments for Windows, which form two
distinct groups today, plus a theoretical third that currently does not exist:

1) native applications built with MSVC (ucrt.dll); _WIN32 is defined
2) native applications built with MinGW (crtdll.dll); _WIN32 is defined
3) Unix applications built with Cygwin / MSYS2; _WIN32 is not defined
4) Unix applications built for Linux, run under WSL; _WIN32 is not defined

It is conceivable that these four types of applications are all mixed together
in a single build, so they could be sharing the same data that P1689 is meant
to share. And CMake is the prime example of this: it can be any of the four,
driving a make and a compiler that is any of the four too.

The three groups are:

a) Wide API available and narrow is ANSI (1 and 2 above)
b) Wide API is available and narrow is UTF-8 (theoretical)
c) no Wide API, narrow is UTF-8 (3 and 4)

Group c only has open() and fopen() available. Fortunately, the Cygwin/MSYS2
runtime take the narrow character input and converts to wchar_t using UTF-8 (I
don't know whether it's CESU-8), so those applications just work. For them,
option 2a is a pass-through; option 2b requires the UTF-16 to UTF-8 codec,
then pass though; and option 1 admits the pass-through solution with an
#ifdef.

Group b has both APIs available. For this group, pass through is available in
both options 2a and 2b and can take the shortcut on option 1.

For both groups b and c, Unix applications can be rebuilt on Windows with
little to no porting.

Group a MUST NOT use _open() and fopen(). No exceptions. This means Unix
applications must be ported to Windows in order to operate properly if
compiled with those compilers, so that they will use _wfopen() or
CreateFileW(). For those, pass-through is only possible under option 2b, if
the payload is 16-bit.

== Transport ==
P1689 suggests using JSON. I'm comparing that in the context of the three
options with a binary format (CBOR).

One thing SG16 is completely in agreement of is that if you go with JSON, you
must obey RFC 8259: there must not be a BOM and the file must be encoded in
UTF-8.

Option 1) Since file names are text, JSON is actually well-placed and the file
names are stored as JSON Strings. This is easy to debug in any UTF-8 capable
text editor, though of course one that understands JSON is recommended. Most
JSON APIs provide strings directly in UTF-8, so that content can be passed to
the UTF-8 to locale encoder / decoder. CBOR also stores text strings as UTF-8,
so the same ease of encoding and decoding to the locale is there.

Option 2a) File names are binary data, so they MOST NOT be stored as-is in
JSON strings. I recommend either base64 in a string or an array of numbers.
For this, a binary solution is better: CBOR has a type called "byte string",
which can store binary data.

Option 2b) is an extension of 2a. You store the payload the same way, except
that you must also store a tag indicating whether the data was 8 or 16-bit. If
using Base64, it must also indicate whether it's big-endian or little (this
problem does not exist for an array of numbers). The same constraints apply to
CBOR and I do not recommend storing as an array of numbers as that will double
the space necessary to store compared to a byte string and will be sloer to
encode and decode.

This is it. I know this is a long email, but hopefully it helps you come to
some conclusions.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-05 06:13:07