sg16: Re: [SG16-Unicode] Need a char8_t implementation for filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Mon, 19 Aug 2019 11:23:12 +0100

> The discussion highlighted what may have been a misunderstanding on my
> part. I had been viewing the std::byte interface as intending to allow
> programmers to explicitly specify path names that would be stored
> exactly with the name provided on the filesystem. Others had what, to
> me, is a much more reasonable perspective; that the std::byte interface
> exists as a means to pass back to the OS a path that was previously
> provided by the OS (as opposed to something arbitrary constructed by a
> programmer). In other words, the std::byte interface accepts a pointer
> to the underlying representation of an opaque structure (e.g., a
> sequence of bytes on Linux, a sequence of wchar_t on Windows).
>
> If this perspective better aligns with your intents, then I'm somewhat
> more on board with it, though I think std::byte is too generic an
> abstraction. What about defining the interface in term of a
> `raw_path[_view]` type that is an opaque implementation defined type? I
> think this would help to avoid incorrect use or abuse of the std::byte
> interface, particularly on systems where path names are just a sequence
> of bytes. It might make sense for the `raw_path[_view]` type to be
> constructible from simple inputs (e.g., sequences of `char` on Linux and
> `wchar_t` on Windows).

I found the minutes, particularly the differing understandings of design
rationale, useful.

But ultimately most of the understandings are inaccurate. path_view
ended up choosing the alt-filesystem::path design which came second
after what was chosen in the Boost peer review. That wasn't intentional,
I tried two other designs beforehand, but it turns out the second choice
of design which was narrowly defeated at the time of the peer review was
the right one.

The key point being missed is that path_view encapsulates runtime
polymorphism at the C++ level, because that is reality on the ground.
The source of the view can be of A, B or C. The consumer of the view can
require A, B or C. Translations to go from whatever the source is, to
whatever the consumer says is required for the syscall, are performed as
needed, **at the C++ level**, wrapping the syscall which consumes or
returns a filesystem path.

This runtime polymorphism allows implementation code to emit filesystem
paths without dynamic memory allocation, copying of memory, nor encoding
translation. You tag what format the source of the view is, and emit it.
You thus push when to perform dynamic memory allocation, copying of
memory, and encoding translation onto the consumer of the path.

The consumer -- which the emitter has no coupling with -- may be able to
accept multiple forms of input, or just a single form of input. If the
source format is identical to the consumer's format, the original source
is used as-is. Otherwise a just-in-time mapping of source to consumer
format is performed around the syscall which consumes the filesystem path.

And that's basically it, in terms of design rationale.

Let me reemphasise at this point how much is determined by the consumer
of filesystem paths, specifically that supplying filesystem paths to a
ZIP archive library might have very different consumer-determined
behaviours to kernel syscalls.

Let me *doubly* reemphasise that filesystem::path's whole notion of
there being a single "native encoding" for paths is fundamentally WRONG.
In LLFIO, paths are interpreted on Windows as char or wchar_t based on
RUNTIME factors. Even on Linux, if you used a path_view with Java JNI,
the consumer would expect 16 bit codepoints.

The truth is that filesystem paths have multiple "native" encodings
depending on who and what consumes them. Thus, filesystem::path is
inappropriate for such use cases, whereas path_view is intended to fill
that gap.

Regarding raw_path, personally speaking I am unconvinced of the value of
having dual API overloads for every filesystem path consuming function,
not least because a raw_path could be fed to char or wchar_t functions
depending on runtime factors, which then means implementation detail
must be leaked out so the correct raw_path can be chosen. Which seems
self defeating to me.

I think that raw byte array path sources are an escape hatch, one not
expected to be used in ordinary use cases, but important for the
situations where the programmer knows the implementation details of the
consumer of the path view, and very strongly wants to avoid inefficiency
(note that vast reams of code may be elided under optimisation when the
compiler is guaranteed that input is never to be translated). I'd have
no objection to removing the byte*+size_t constructor in path_view into
a raw_path() free function, which would more strongly indicate its
escape hatch status. Would that work for you Tom?

Niall

Received on 2019-08-19 12:23:17