sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 3 Jul 2019 16:35:43 -0400

On 7/3/19 1:10 PM, Niall Douglas wrote:
> On 03/07/2019 17:07, Tom Honermann wrote:
>> On 7/3/19 11:11 AM, Niall Douglas wrote:
>>> Dear SG16,
>>>
>>> Titus would like to know if SG16 has no objection to LEWG reviewing
>>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1030r2.pdf
>>> "std::filesystem::path_view" at the Cologne meeting? SG16 approval is
>>> sought because P1030 depends on P0482 char8_t.
>> Disclaimer: I have not yet read P1030R2, though I have read the prior
>> revisions.
>>
>> SG16 reviewed prior revisions of P1030 at several of our telecons.
>> Notes on those reviews are available at the following links (search for
>> 'path_view'):
>>
>> * https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-11th-2018
>> * https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#may-30th-2018
>>
>> The only poll we conducted was during the May 30th, 2018 meeting. Polls
>> at our telecons are informal.
> FYI I was completely unaware that SG16 had discussed P1030. Nobody told me.
Ah, I'm sorry for that. I thought we had delegated someone to follow up
with you, and from JeanHeyd's response, it sounds like he was that
delegate and did follow up, but that the connection back to SG16 was
lost in translation. I should have followed up with you directly to
ensure that our feedback was received.
>
>> Note that the SG16 Unicode Direction paper was updated for the
>> pre-meeting mailing with relevant discussion concerning encoding of file
>> names. Please take a look at P1238R1
>> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1238r1.html)
>> sections 2.7 and 6.3 and provide any feedback.
> It is mostly accurate. Some additional data points are that the NT
> kernel permits both NUL and backslashes in filenames. So, if P1031 LLFIO
> is in NT kernel path mode, you can literally dump random bytes into a
> filename, and it just works (though it may crash Windows Explorer, and
> nothing on Windows can either open nor delete the file entry).
Interesting, I wasn't aware of that behavior. Is it documented anywhere?
>
> For Apple filing systems, I was under the impression that the
> implementation tries straight memcmp() first. Only if memcmp() fails
> does it fall back onto Unicode-based filesystem entry matching. So, like
> on POSIX, if you feed binary bytes as a filesystem entry (with 0 and '/'
> removed), that "just works". In other words, your mention of Unicode
> matching ought to be described as a fallback mechanism to bitwise
> matching, which only occurs if memcmp() fails.
I had a hard time finding good documentation for Apple's filesystems.
However, I don't think the memcmp() optimization is germane to the
described behavior.
>
> Finally, I believe all major OS kernels have dropped the filesystem
> itself matching filenames using anything other than memcmp(). This is
> because unicode comparison is slow, and it murders the performance of
> some file system algorithms which (for example) look for a marker file
> in many directories. Unicode matching is therefore best done as an
> opt-in layer on top of the filesystem implementation.
That could be and the performance implications appear to have been part
of Apple's motivation for migrating from HFS+ to APFS. I suspect OS
kernels still have to maintain interfaces for legacy filesystems though.
>>> If on POSIX, it would undergo a
>>> just-in-time downconversion to UTF-8.
>> This seems to (necessarily) violate the "avoid ever copying memory"
>> requirement. It also assumes an encoding of filenames that should be
>> left up to the implementation (I would be opposed to specifying a
>> conversion to UTF-8 here; that would be inconsistent with
>> std::filesystem::path).
> The aim is for path_view to be usually no worse than path, nothing more.
> If the input is in UTF-8, and the system API requires UTF-16, then you
> need to convert, same as for path. Unless you want to push mandatory
> #ifdef-ing onto the end user, which I don't think we want.
I agree, we don't want that :)
>
> For those users which really care, they'll take the effort to #ifdef,
> and they'll see the corresponding performance gains. I think that's a
> good balance of convenience over performance.
Agreed, though #ifdef isn't necessarily required. The intent in
defining std::filesystem::path::value_type and
std::filesystem::path::string_type is to provide a generic mechanism for
working with the different types. I acknowledge this is a leaky
abstraction and, due to the lack of convenient conversion methods,
cumbersome to use effectively.
>
> Finally, the real big gain with path_view is path manipulation and
> directory enumeration. Common path manipulation sequences go order of
> magnitudes faster when no dynamic memory allocation and frees are
> occurring for each individual step. filesystem::directory_iterator is
> widely recognised to not be fit for purpose (see much lamenting on
> stackoverflow). llfio::directory_handle::read() will enumerate 10M item
> directories without issue. The main reason why is path_view, enumeration
> returns path_views of the system's native path format. This avoids 10M
> dynamic memory allocations. In fact, llfio::directory_handle::read()
> doesn't allocate any memory at all, it just asks the kernel to blat into
> a buffer, and we set up the path views to point into that kernel-filled
> buffer.
>
> So tl;dr; unlike string_view, path_view isn't a replacement for path,
> but rather an adjunct to it. LLFIO uses path or path_view each where it
> is the most appropriate. You should consider path_view in that light, as
> a path-extender, not path-replacer.
Acknowledged. This is more LEWG territory than SG16.
>
>>> This works by path_view inspecting the char literal passed to it for its
>>> type. It refuses to accept `char`, it will only accept byte [1],
>>> char8_t, char16_t and char32_t. In other words, you must explicitly
>>> specify the UTF encoding for path view input.
>> I personally don't feel it is necessary to refuse 'char', though I
>> understand the motivation. The inconsistency with std::filesystem::path
>> is more concerning to me than the use of 'char' to mean "bytes". This
>> is something worth polling in SG16.
> After lengthy discussion with Billy, I personally think that path ought
> to refuse char input in future C++ standards. Force people to specify
> what they actually mean. Tons of path-based code out there is subtly
> broken due to use of char, particular on Windows, where char input to
> path doesn't at all mean what it means on POSIX.
>
> I should add that the killing off of char input was strongly requested
> by Billy. I got the feeling it was a red line for him. I can understand
> why, from a MSVC-implementer perspective, and I have witnessed first
> hand the brokenness of char input to path on Windows.

Since you mentioned Billy, I CC'd him. Perhaps he can join us in SG16
when this discussion comes up.

We'll talk more about this at the meeting, but something to think about
in the mean time. Consider my favorite not-all-platforms-are-like-that
operating system: z/OS. z/OS is EBCDIC based and supports a POSIX
conforming environment. Like other POSIX systems, file names are
sequences of bytes with (EBCDIC) '/' and '\0' reserved. In practice,
file names are encoded in some locale dependent EBCDIC code page. On
many POSIX systems, it may be expected that the 'char8_t' interface
would just pass those bytes straight through to the filesystem, but that
won't be true for z/OS. If a 'char' interface is not supported,
programmers will have to use the 'byte' interface if they want to avoid
the Unicode conversion costs and ensuing issues (errors due to
ill-formed names, errors due to unmappable characters and/or clashes due
to use of substitution characters, etc...). I suspect that most file
names use only characters from the basic source character set. Omitting
a 'char' interface thereby penalizes what is probably the most common
use case.

I do agree that programmers get file names wrong frequently. But the
'char' interfaces can be, and are often, used correctly and are the
dominant (portable / cross-platform) interface in use today.

>> Niall, assuming you will be able to attend SG16 in Cologne, please come
>> prepared with a specific list of questions/polls you would like feedback
>> on. I'll have a list as well (though I don't yet).
> I'm not sure if I have any, if I am honest. The SG16 feedback for R0 was
> valid, but I think it has all been addressed by R2.
Excellent, that is good to hear.
>
> I would welcome any comment, though please do read R2 first, as I think
> all the most likely comments are already answered there.
Of course, definitely.
>
>>> [1]: Raw bytes are accepted because most filesystems will actually
>>> accept raw binary bytes (minus a few reserved values such as '/' or 0)
>>> as path identifiers without issue. There is no interpretation done based
>>> on Unicode. Thus, not supporting raw bytes makes many valid filenames
>>> impossible to work with.
>> Agreed, though for some filesystems and filesystem APIs, it is even
>> worse (e.g., '\' (0x5C) can appear as a trailing code unit for Shift-JIS
>> file names for Windows ANSI file APIs and does not indicate a path
>> separator in that case).
> We must be careful to distinguish between the filesystem path
> translation layer which Win32 implements for backwards compatibility
> with Win16, and the actual filesystem path support.
>
> The actual filesystem path support on Windows is a bag of bits with no
> interpretation at all, unless you ask for case insensitive filename
> comparison (which Win32 does). Then it firstly tries memcmp(), and then
> falls back to locale-aware comparison. Newer NTFS lets you set the
> comparison method per directory, so some directories can be case
> insensitive, others case sensitive, or else the system default.
>
> Linux intends to add a similar layer-over-the-filesystem optional case
> insensitive path matching. See https://lwn.net/Articles/772960/.
>
> The point that I am making is that Unicode interpretation of filesystem
> paths is entirely outside our control. It is whatever the OS has decided
> it to be, and moreover it can *vary*, per directory. We supply an array
> of bytes, what those bytes mean is OS-determined. Similarly, when
> reading paths, they too are an array of bytes of unspecified meaning. We
> must be takers on that, and devise some non-crashing method of rendering
> such byte strings into readable text (and do better than Windows Explorer!)
>
> (Incidentally, the proper way to parse path components is to use
> llfio::path_handle, and ask it repeatedly for its parent path_handle.
> For each of those path_handle, you can ask them for their current_path()
> which is what the kernel says is the handle's current path, and from
> that determine which path fragments contain slashes and Shift-JIS bits
> etc. It's the only way, I am afraid, of reliably parsing path components)
I strongly agree with all of the above. Thanks for the lwn.net link, I
had not seen that.
>
> This may explain why I don't have any questions at the current time for
> SG16.
Polls can be used for more than just feedback. You have considerable
experience that we might all benefit from. Asking for a poll for
something that you already "know" the answer to can help determine where
educational efforts may be warranted or to identify surprising
perspectives. At least some of what you are proposing will be
controversial (e.g., omitting 'char' based interfaces) and polling on
those features may be helpful to facilitate further discussion that will
improve chances for consensus later. Perhaps think of polls as more of
an avenue for influence than requests for (technical) answers and guidance.
> To the best of my knowledge, filesystem paths fall outside our
> power to do much about them. Please do correct me if I am wrong on this
> interpretation.

Strongly agreed. See 6.3 in P1238R1 :)

Tom.

>
> Niall
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-07-03 22:35:46