sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Mon, 8 Jul 2019 11:26:18 +0100

On 08/07/2019 10:11, Henri Sivonen wrote:
> On Wed, Jul 3, 2019, 17:11 Niall Douglas <s_sourceforge_at_[hidden]
> <mailto:s_sourceforge_at_[hidden]>> wrote:
>
> To explain the dependency, path_view really wants to avoid ever copying
> memory.
>
>
> The paper shows
> bool _buffer_needs_freeing;
>
> I find it surprising that a type name foo_view is more like a
> copy-on-write type than a pure view.

That's in path_view::c_str, not in path_view. Path views are trivial types.

> means we cannot use the same API as filesystem::path. So,
> if on Windows you write:
>
> path_view(u16"c:/some/path");
>
> ... then the literal string gets passed directly through to the wide
> character Windows syscalls, uncopied. If on POSIX, it would undergo a
> just-in-time downconversion to UTF-8.
>
>
> If whether a copy happens is platform-dependent, it’s pretty unfortunate
> that this paper and the file system API don’t adopt WTF-8 and putting
> the conversion on the Windows side. See below.

Whether a copy happens depends on whether the input is zero terminated,
and if that platform requires zero termination of paths.

Whether a UTF conversion happens depends on whether the user supplied a
different UTF encoding to the declared platform-specific default encoding.

> [1]: Raw bytes are accepted because most filesystems will actually
> accept raw binary bytes (minus a few reserved values such as '/' or 0)
> as path identifiers without issue. There is no interpretation done based
> on Unicode. Thus, not supporting raw bytes makes many valid filenames
> impossible to work with.
>
>
> To evaluate this, it would be important to state what the semantics for
> bytes are on Windows. Interpreting them according to the “ANSI” code
> page of the process would be traditional but does not allow addressing
> all files and goes directly against the motivation stated.

Path view doesn't specify what consumers do with the path view data, but
P1031 LLFIO currently always does this on Microsoft Windows:

1. Byte input => Passthrough bytes untouched.

2. UTF-8 input => to UTF-16 conversion => Submit bytes.

3. UTF-16 input => Passthrough bytes untouched.

P1031 LLFIO does not use the ANSI Windows APIs, at all ever.

For completeness, this is what P1031 LLFIO does on POSIX:

1. Byte input => Passthrough bytes untouched.

2. UTF-8 input => Passthrough bytes untouched.

3. UTF-16 input => to UTF-8 conversion => Submit bytes.

In other words, the UTF-8/UTF-16 encoding is EXCLUSIVELY user side only.
It is there merely for C++ code portability. It does not provide --
because it cannot -- any form of portability once the bunch of bytes
reach the OS kernel.

> I encourage the committee to look at supporting WTF-8
> (https://simonsapin.github.io/wtf-8/) as an 8-bit-code-unit encoding that
> 1) Allows addressing all NT file paths
> 2) Is equivalent to UTF-8 for those NT file paths that have a textual
> interpretation.

I must reiterate, once again, that filesystem paths are primarily
matched by memcmp() on Microsoft Windows, and only if that does not
match does a non-bits match OPTIONALLY occur.

As I already explained, different parts of a path may have different
matching algorithms, because each directory on Microsoft Windows can
specify how it is to be matched if exact match failed.

Depending on those settings, UCS-16, UTF-16, or something may be used,
per path item. This is TOTALLY outside user space control.

WTF-8 is useful in many parts of Microsoft Windows, but for filesystem
paths I find it of very limited utility.

> This allows for portable code (where the platform-dependent conversion
> goes on the Windows side unlike in the example above). On POSIX-like
> platforms, you’d instead use a sequence of bytes to address all file
> paths and if the path happens to be valid UTF-8, the path has a textual
> interpretation. To enable even more portable application code, you’d
> make the file system library swap \ and / on Windows similar to how :
> and / are swapped to expose HFS+ in a POSIX-compatible way.
>
> (The Rust file system API uses WTF-8 on Windows enabling portable
> application layer code that can address all file paths in both Windows
> and POSIX with cheap conversion to/from UTF-8 strings.)

I have no wish to pick a fight (again!) with Rust system library
designers, but some of them lack experience. I am unaware of any valid
use case for anything but a bunch of bits as filesystem path identifiers
on BOTH POSIX and Windows. Neither lets you inform the OS of encoding
per path, therefore these are bunches of bits. You cannot reliably
interpret them as human text, so correct software design does not do that.

I would have no issue with supporting WTF-8 as an alternative encoding
to UTF-8 and UTF-16 in path views, incidentally. But I suspect that
POSIX folk wouldn't value the niche use case (for them), and the value
is very limited when Windows can use arbitrary encoding interpretations
for different parts of the path.

Niall

Received on 2019-07-08 12:26:25