sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Fri, 6 Sep 2019 16:17:27 +0100

>> My reading of their paper was that they want to encode non-UTF8
>> sequences into UTF8 paths in a JSON file. I don't think they should take
>> that path, because it loses too much information.
>
> We'll have to clarify the wording. We propose two allowed representations:
>
> - An array of integers tagged with the corresponding size of values in memory.
> This can represent an arbitrary binary sequence and is the general form.
>
> This variant also allows a "readable-name" field intended only for human
> consumption that is not meant for use in accessing the filesystem.
> It is optional and superfluous for tooling but useful for debugging.
>
> - UTF-8. This is allowed *only if a lossless round trip* is possible
> between the filesystem's native binary sequence and UTF-8. E.g. on
> Windows we should not have to require the full general format to represent
> a simple path like "a.cxx" just because the filesystem APIs use wide chars.
>
> This is intended for the common use case of ASCII-only file paths to make
> the format simpler and more human readable (e.g. for debugging). We then
> generalize beyond ASCII to allow any lossless UTF-8 round-trip (implying
> that the locale does not change).

Thanks for the clarification.

Firstly, you appear to assume that there is only one filesystem native
encoding possible. This is incorrect. It is permitted to vary per
program invocation. And different build tools may experience different
native filesystem native encodings.

Indeed, it may even be permitted by the standard to vary as the program
executes, but I am not an expert on that topic. Certainly it can vary as
the program executes on Windows, but that might not be standards
conforming. Somebody expert on SG16 may confirm one way or another.

In any case, for your paper, you need to account for native filesystem
encoding being runtime reconfigurable. A tool working with your JSON
files may experience multiple native filesystem encodings, changing at
various points during its execution. It would thus need to write out not
only the raw bytes, but also what encoding it thinks those bytes are in
i.e. what the currently set locale says they are in. Otherwise the same
program, when reading the JSON it itself earlier wrote, may not
understand the raw bytes it itself wrote earlier.

Secondly, I want to make sure that your "lossless round trip" means a
very specific thing: the conversion of the filesystem native encoding to
UTF8, and back to the filesystem native encoding, and then a *byte
comparison* of equivalence (not lexicographic comparison).

This would be successful, if and only if the native filesystem encoding
is identical, and that the UTF8 to native filesystem encoding routines
are identical. I should caution you that they may not be, for example
the Windows NT kernel has a separate set of routines to Win32, and both
of those are separate to what the C++ standard library ships with. Each
has its own quirks and corner cases.

Finally, something which you didn't address at all in your paper is
filesystem path lookup realms. On Microsoft Windows, each subsystem has
its own filesystem path lookup realm, each with its own rules regarding
comparison and interpretation of paths. However, all paths in all
subsystems do map onto one another, though the mapping may not be one-one.

In other words, a path in one realm may be meaningless in another realm,
or worse, map to a different file than intended. So you may wish to
define the lookup realm for filesystem paths in your JSON file, and
certainly list those for Windows. Off the top of my head:

- DOS 8.3
- DOS Win95 extended paths (close enough to Win32 ANSI)
- Win32 wchar_t, NT semantics
- Win32 wchar_t, POSIX semantics
- NT kernel byte array, NT semantics
- NT kernel byte array, POSIX semantics
- 160 bit GUID

Most reading this post will consider that way overkill. However, if we
do get POSIX to adopt O_BINARYPATH, then an alternative filesystem path
lookup realm on POSIX would be some 8, 16, 32 or 64 byte binary array.

In my opinion it would do no harm in your proposal to define a path
lookup realm, and define at least two values in your proposal:

1. posix_root_path i.e. /a/b/c
2. win32_dos_semantics i.e. X:\a\b\c

... and mention that other values might get standardised in the future,
if demand presents.

(If P1031 LLFIO gets chosen by WG21, we would add quite a few from my
list above, as LLFIO supports all those apart from DOS 8.3)

Niall

Received on 2019-09-06 17:17:30