sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Mon, 8 Jul 2019 11:11:20 +0200

On Wed, Jul 3, 2019, 17:11 Niall Douglas <s_sourceforge_at_[hidden]> wrote:

> To explain the dependency, path_view really wants to avoid ever copying
> memory.

The paper shows
bool _buffer_needs_freeing;

I find it surprising that a type name foo_view is more like a copy-on-write
type than a pure view.

means we cannot use the same API as filesystem::path. So,
> if on Windows you write:
>
> path_view(u16"c:/some/path");
>
> ... then the literal string gets passed directly through to the wide
> character Windows syscalls, uncopied. If on POSIX, it would undergo a
> just-in-time downconversion to UTF-8.
>

If whether a copy happens is platform-dependent, it’s pretty unfortunate
that this paper and the file system API don’t adopt WTF-8 and putting the
conversion on the Windows side. See below.

This works by path_view inspecting the char literal passed to it for its
> type. It refuses to accept `char`, it will only accept byte [1],
> char8_t, char16_t and char32_t. In other words, you must explicitly
> specify the UTF encoding for path view input.
>
> Is SG16 happy for LEWG to review P1030?
>
> Niall
>
> [1]: Raw bytes are accepted because most filesystems will actually
> accept raw binary bytes (minus a few reserved values such as '/' or 0)
> as path identifiers without issue. There is no interpretation done based
> on Unicode. Thus, not supporting raw bytes makes many valid filenames
> impossible to work with.
>

To evaluate this, it would be important to state what the semantics for
bytes are on Windows. Interpreting them according to the “ANSI” code page
of the process would be traditional but does not allow addressing all files
and goes directly against the motivation stated.

I encourage the committee to look at supporting WTF-8 (
https://simonsapin.github.io/wtf-8/) as an 8-bit-code-unit encoding that
1) Allows addressing all NT file paths
2) Is equivalent to UTF-8 for those NT file paths that have a textual
interpretation.

This allows for portable code (where the platform-dependent conversion goes
on the Windows side unlike in the example above). On POSIX-like platforms,
you’d instead use a sequence of bytes to address all file paths and if the
path happens to be valid UTF-8, the path has a textual interpretation. To
enable even more portable application code, you’d make the file system
library swap \ and / on Windows similar to how : and / are swapped to
expose HFS+ in a POSIX-compatible way.

(The Rust file system API uses WTF-8 on Windows enabling portable
application layer code that can address all file paths in both Windows and
POSIX with cheap conversion to/from UTF-8 strings.)

>

Received on 2019-07-08 12:41:34