sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Wed, 3 Jul 2019 18:10:59 +0100

On 03/07/2019 17:07, Tom Honermann wrote:
> On 7/3/19 11:11 AM, Niall Douglas wrote:
>> Dear SG16,
>>
>> Titus would like to know if SG16 has no objection to LEWG reviewing
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1030r2.pdf
>> "std::filesystem::path_view" at the Cologne meeting? SG16 approval is
>> sought because P1030 depends on P0482 char8_t.
>
> Disclaimer: I have not yet read P1030R2, though I have read the prior
> revisions.
>
> SG16 reviewed prior revisions of P1030 at several of our telecons.
> Notes on those reviews are available at the following links (search for
> 'path_view'):
>
> * https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-11th-2018
> * https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#may-30th-2018
>
> The only poll we conducted was during the May 30th, 2018 meeting. Polls
> at our telecons are informal.

FYI I was completely unaware that SG16 had discussed P1030. Nobody told me.

Most of the reforms in R1 are due to Billy O'Neal, who objected to R0
wearing his standard library implementer's hat. R1 he seems to be happy
with.

> I would like for SG16 to re-review P1030 at Cologne and take some
> official polls. I added it to our agenda at
> http://wiki.edg.com/bin/view/Wg21cologne2019/SG16.
>
> Niall, will you be present in Cologne?

Yes I will.

> Note that the SG16 Unicode Direction paper was updated for the
> pre-meeting mailing with relevant discussion concerning encoding of file
> names. Please take a look at P1238R1
> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1238r1.html)
> sections 2.7 and 6.3 and provide any feedback.

It is mostly accurate. Some additional data points are that the NT
kernel permits both NUL and backslashes in filenames. So, if P1031 LLFIO
is in NT kernel path mode, you can literally dump random bytes into a
filename, and it just works (though it may crash Windows Explorer, and
nothing on Windows can either open nor delete the file entry).

For Apple filing systems, I was under the impression that the
implementation tries straight memcmp() first. Only if memcmp() fails
does it fall back onto Unicode-based filesystem entry matching. So, like
on POSIX, if you feed binary bytes as a filesystem entry (with 0 and '/'
removed), that "just works". In other words, your mention of Unicode
matching ought to be described as a fallback mechanism to bitwise
matching, which only occurs if memcmp() fails.

Finally, I believe all major OS kernels have dropped the filesystem
itself matching filenames using anything other than memcmp(). This is
because unicode comparison is slow, and it murders the performance of
some file system algorithms which (for example) look for a marker file
in many directories. Unicode matching is therefore best done as an
opt-in layer on top of the filesystem implementation.

>> To explain the dependency, path_view really wants to avoid ever copying
>> memory. This means we cannot use the same API as filesystem::path. So,
>> if on Windows you write:
>>
>> path_view(u16"c:/some/path");
>>
>> ... then the literal string gets passed directly through to the wide
>> character Windows syscalls, uncopied.
>
> Presumably via a reinterpret_cast somewhere along the line? That would
> be UB outside the standard library.

Standard library implementations can do what they like.

>> If on POSIX, it would undergo a
>> just-in-time downconversion to UTF-8.
>
> This seems to (necessarily) violate the "avoid ever copying memory"
> requirement. It also assumes an encoding of filenames that should be
> left up to the implementation (I would be opposed to specifying a
> conversion to UTF-8 here; that would be inconsistent with
> std::filesystem::path).

The aim is for path_view to be usually no worse than path, nothing more.
If the input is in UTF-8, and the system API requires UTF-16, then you
need to convert, same as for path. Unless you want to push mandatory
#ifdef-ing onto the end user, which I don't think we want.

For those users which really care, they'll take the effort to #ifdef,
and they'll see the corresponding performance gains. I think that's a
good balance of convenience over performance.

Finally, the real big gain with path_view is path manipulation and
directory enumeration. Common path manipulation sequences go order of
magnitudes faster when no dynamic memory allocation and frees are
occurring for each individual step. filesystem::directory_iterator is
widely recognised to not be fit for purpose (see much lamenting on
stackoverflow). llfio::directory_handle::read() will enumerate 10M item
directories without issue. The main reason why is path_view, enumeration
returns path_views of the system's native path format. This avoids 10M
dynamic memory allocations. In fact, llfio::directory_handle::read()
doesn't allocate any memory at all, it just asks the kernel to blat into
a buffer, and we set up the path views to point into that kernel-filled
buffer.

So tl;dr; unlike string_view, path_view isn't a replacement for path,
but rather an adjunct to it. LLFIO uses path or path_view each where it
is the most appropriate. You should consider path_view in that light, as
a path-extender, not path-replacer.

>> This works by path_view inspecting the char literal passed to it for its
>> type. It refuses to accept `char`, it will only accept byte [1],
>> char8_t, char16_t and char32_t. In other words, you must explicitly
>> specify the UTF encoding for path view input.
>
> I personally don't feel it is necessary to refuse 'char', though I
> understand the motivation. The inconsistency with std::filesystem::path
> is more concerning to me than the use of 'char' to mean "bytes". This
> is something worth polling in SG16.

After lengthy discussion with Billy, I personally think that path ought
to refuse char input in future C++ standards. Force people to specify
what they actually mean. Tons of path-based code out there is subtly
broken due to use of char, particular on Windows, where char input to
path doesn't at all mean what it means on POSIX.

I should add that the killing off of char input was strongly requested
by Billy. I got the feeling it was a red line for him. I can understand
why, from a MSVC-implementer perspective, and I have witnessed first
hand the brokenness of char input to path on Windows.

>> Is SG16 happy for LEWG to review P1030?
>
> Per P1253
> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1253r0.html),
> there is sufficient justification for SG16 review prior to LEWG review.
> However, I also think there is sufficient content in the proposal that
> does not require review by SG16 that LEWG could make progress reviewing
> those parts while awaiting SG16 review (I suspect LEWG will request
> changes). Ideally, SG16 will review this at the meeting on Wednesday
> and LEWG will review it with SG16 input included Thursday or Friday.
> But that scheduling is, of course, up to the LEWG chair.
>
> Niall, assuming you will be able to attend SG16 in Cologne, please come
> prepared with a specific list of questions/polls you would like feedback
> on. I'll have a list as well (though I don't yet).

I'm not sure if I have any, if I am honest. The SG16 feedback for R0 was
valid, but I think it has all been addressed by R2.

I would welcome any comment, though please do read R2 first, as I think
all the most likely comments are already answered there.

>> [1]: Raw bytes are accepted because most filesystems will actually
>> accept raw binary bytes (minus a few reserved values such as '/' or 0)
>> as path identifiers without issue. There is no interpretation done based
>> on Unicode. Thus, not supporting raw bytes makes many valid filenames
>> impossible to work with.
>
> Agreed, though for some filesystems and filesystem APIs, it is even
> worse (e.g., '\' (0x5C) can appear as a trailing code unit for Shift-JIS
> file names for Windows ANSI file APIs and does not indicate a path
> separator in that case).

We must be careful to distinguish between the filesystem path
translation layer which Win32 implements for backwards compatibility
with Win16, and the actual filesystem path support.

The actual filesystem path support on Windows is a bag of bits with no
interpretation at all, unless you ask for case insensitive filename
comparison (which Win32 does). Then it firstly tries memcmp(), and then
falls back to locale-aware comparison. Newer NTFS lets you set the
comparison method per directory, so some directories can be case
insensitive, others case sensitive, or else the system default.

Linux intends to add a similar layer-over-the-filesystem optional case
insensitive path matching. See https://lwn.net/Articles/772960/.

The point that I am making is that Unicode interpretation of filesystem
paths is entirely outside our control. It is whatever the OS has decided
it to be, and moreover it can *vary*, per directory. We supply an array
of bytes, what those bytes mean is OS-determined. Similarly, when
reading paths, they too are an array of bytes of unspecified meaning. We
must be takers on that, and devise some non-crashing method of rendering
such byte strings into readable text (and do better than Windows Explorer!)

(Incidentally, the proper way to parse path components is to use
llfio::path_handle, and ask it repeatedly for its parent path_handle.
For each of those path_handle, you can ask them for their current_path()
which is what the kernel says is the handle's current path, and from
that determine which path fragments contain slashes and Shift-JIS bits
etc. It's the only way, I am afraid, of reliably parsing path components)

This may explain why I don't have any questions at the current time for
SG16. To the best of my knowledge, filesystem paths fall outside our
power to do much about them. Please do correct me if I am wrong on this
interpretation.

Niall

Received on 2019-07-03 19:11:03