C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path_view

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Mon, 8 Jul 2019 13:38:00 +0200
On Mon, Jul 8, 2019, 12:26 Niall Douglas <s_sourceforge_at_[hidden]> wrote:

> > To evaluate this, it would be important to state what the semantics for
> > bytes are on Windows. Interpreting them according to the “ANSI” code
> > page of the process would be traditional but does not allow addressing
> > all files and goes directly against the motivation stated.
>
> Path view doesn't specify what consumers do with the path view data, but
> P1031 LLFIO currently always does this on Microsoft Windows:
>
> 1. Byte input => Passthrough bytes untouched.
>

On Windows, leaving the bytes “untouched” doesn’t really leave the
untouched: the FooA() APIs convert to UTF-16 before the data reaches the
kernel, so pretending on the C++ standard library layer that this does not
happen is not useful and fails the stated goal of being able to represent
file paths that don’t constitute a sequence of Unicode scalar values.

Making this WTF-8 to UTF-16 conversion and then submitting the UTF-16 code
units would allow addressing all NT file paths with 8-bit code units.

2. UTF-8 input => to UTF-16 conversion => Submit bytes.
>
> 3. UTF-16 input => Passthrough bytes untouched.
>
> P1031 LLFIO does not use the ANSI Windows APIs, at all ever.
>
>
> For completeness, this is what P1031 LLFIO does on POSIX:
>
> 1. Byte input => Passthrough bytes untouched.
>

WTF-8 on Windows would be the best match for this behavior from the
portability perspective.

2. UTF-8 input => Passthrough bytes untouched.
>
> 3. UTF-16 input => to UTF-8 conversion => Submit bytes.
>
> In other words, the UTF-8/UTF-16 encoding is EXCLUSIVELY user side only.
> It is there merely for C++ code portability. It does not provide --
> because it cannot -- any form of portability once the bunch of bytes
> reach the OS kernel.
>

But you don’t get to submit bytes to the NT kernel. You get to submit
unsigned 16-bit code units.

WTF-8 is the best way to let the application be written as if the NT kernel
took bytes.

> I encourage the committee to look at supporting WTF-8
> > (https://simonsapin.github.io/wtf-8/) as an 8-bit-code-unit encoding
> that
> > 1) Allows addressing all NT file paths
> > 2) Is equivalent to UTF-8 for those NT file paths that have a textual
> > interpretation.
>
> I must reiterate, once again, that filesystem paths are primarily
> matched by memcmp() on Microsoft Windows, and only if that does not
> match does a non-bits match OPTIONALLY occur.
>

The input to the matching operation is 16-bit units, though.


As I already explained, different parts of a path may have different
> matching algorithms, because each directory on Microsoft Windows can
> specify how it is to be matched if exact match failed.
>
> Depending on those settings, UCS-16, UTF-16, or something may be used,
> per path item. This is TOTALLY outside user space control.
>
> WTF-8 is useful in many parts of Microsoft Windows, but for filesystem
> paths I find it of very limited utility.
>

That’s an odd position considering that file paths on modern
(NT-kernel-era) Windows via the Win32 API is a motivating use case for the
design of WTF-8 and there are other programming language standard
libraries, plural, that use it for the Windows path purpose. See
https://simonsapin.github.io/wtf-8/#implementations

I am unaware of any valid
> use case for anything but a bunch of bits as filesystem path identifiers
> on BOTH POSIX and Windows. Neither lets you inform the OS of encoding
> per path, therefore these are bunches of bits.
>

On Windows, the bits are sequences of 16-bit units. On POSIX they are
sequences of 8-bit units. On Windows, if the 16-bit units form a valid
UTF-16 sequence, the path has a textual interpretation. On modern* POSIXish
systems, if the 8-bit units form a valid UTF-8 sequence, the path has a
textual interpretation. WTF-8 allows one to write application code on
Windows as if the paths had the modern POSIX properties. This is hugely
valuable for writing portable application code that can still address file
paths that don’t have a textual interpretation.

* Configuring a POSIXish system in a different way counts explicitly as not
modern for the purpose of this email.

>

Received on 2019-07-08 13:38:16