sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Wed, 3 Jul 2019 22:38:55 +0100

>>> Note that the SG16 Unicode Direction paper was updated for the
>>> pre-meeting mailing with relevant discussion concerning encoding of file
>>> names. Please take a look at P1238R1
>>> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1238r1.html)
>>> sections 2.7 and 6.3 and provide any feedback.
>>
>> It is mostly accurate. Some additional data points are that the NT
>> kernel permits both NUL and backslashes in filenames. So, if P1031 LLFIO
>> is in NT kernel path mode, you can literally dump random bytes into a
>> filename, and it just works (though it may crash Windows Explorer, and
>> nothing on Windows can either open nor delete the file entry).
>
> Interesting, I wasn't aware of that behavior. Is it documented anywhere?

ReactOS source code :)

I confirmed it with a member of the Windows kernel Filesystem team. As
far as they are concerned, a path of:

c:\a\b\c\d\e

... which is path components of:

c:\a
b\c\d
e

... is absolutely legitimate and fine. You must remember that they do
not care for the Win32 layer one bit, and it's merely one of many NT
subsystems in any case.

> I had a hard time finding good documentation for Apple's filesystems.

+1000. Apple are painful on this. I ended up having to trawl the sqlite
mailing list to yield Apple-specific filesystem quirks.

> Agreed, though #ifdef isn't necessarily required. The intent in
> defining std::filesystem::path::value_type and
> std::filesystem::path::string_type is to provide a generic mechanism for
> working with the different types. I acknowledge this is a leaky
> abstraction and, due to the lack of convenient conversion methods,
> cumbersome to use effectively.

In hindsight I think it was the wrong choice. Back during the original
Boost review of this, Peter Dimov very strongly argued for polymorphic
storage in path, so whatever the user supplied was what was stored. He
really went to bat arguing for that design instead. But consensus landed
on exposing the platform native type publicly, and doing conversion to
the native type upon path construction, so Beman chose that instead.

(Do bear in mind that filesystem::path was one of the most contentious
and long winded design debates during Boost review of them all, taking
multiple rounds of review. It was, quite frankly, *exhausting*)

Having used path a lot since, I gotta say Peter was right, and the Boost
consensus got it wrong. That's also partially why path_view looks the
way it does. It's strong in all the areas where path is weak, and it's
an excellent complement to path in my opinion.

> We'll talk more about this at the meeting, but something to think about
> in the mean time. Consider my favorite not-all-platforms-are-like-that
> operating system: z/OS. z/OS is EBCDIC based and supports a POSIX
> conforming environment. Like other POSIX systems, file names are
> sequences of bytes with (EBCDIC) '/' and '\0' reserved. In practice,
> file names are encoded in some locale dependent EBCDIC code page. On
> many POSIX systems, it may be expected that the 'char8_t' interface
> would just pass those bytes straight through to the filesystem, but that
> won't be true for z/OS. If a 'char' interface is not supported,
> programmers will have to use the 'byte' interface if they want to avoid
> the Unicode conversion costs and ensuing issues (errors due to
> ill-formed names, errors due to unmappable characters and/or clashes due
> to use of substitution characters, etc...). I suspect that most file
> names use only characters from the basic source character set. Omitting
> a 'char' interface thereby penalizes what is probably the most common
> use case.

In my opinion, standard library implementers who care about high path
performance on z/OS are free to subclass path_view, and patch in char
input as a proprietary extension for that system only (I would
personally prefer a ebcdic_t character type, not char). End users on
z/OS can also subclass path_view, and implement a char constructor using
the byte constructor.

For path views, if you want to avoid memory copying and unicode
translation, you are going to have to #ifdef for your specific platform.
If you don't #ifdef, you are explicitly accepting "I don't care about
path performance, I want convenience". Same applies to z/OS on this as
it does to Windows. I don't personally see any problem here.

> I do agree that programmers get file names wrong frequently. But the
> 'char' interfaces can be, and are often, used correctly and are the
> dominant (portable / cross-platform) interface in use today.

I don't think it's an understatement to say that most of the char-based
filesystem::path code currently in the wild which appears to work on
both Windows and POSIX is broken. And that's an awful lot of code
affecting an awful lot of non-Western folk, who bear the brunt.

In my opinion, we need to care a lot more about brokenness for 80% of
all code portable to the two major OSs rather than inefficiency for the
1% of code which will be on a EBCDIC native system and is running C++ 23.

(If I were wearing my WG14 hat, I'd care a lot more about minor systems.
But for WG21, there are not a lot of minor systems which will ever run
C++ 23. Almost all future C++ will run on of the big four OS kernels,
using one of the three major C++ compilers, on one of the two major CPU
architectures. As a former embedded systems developer, I'm more than
happy to support niche embedded devices, but the days of PIC are rapidly
giving forth to low end ARM CPUs. It's becoming a real monotheism out
there. And while that's sad, it does enable us to optimise software
design accordingly)

>> This may explain why I don't have any questions at the current time for
>> SG16.
>
> Polls can be used for more than just feedback. You have considerable
> experience that we might all benefit from. Asking for a poll for
> something that you already "know" the answer to can help determine where
> educational efforts may be warranted or to identify surprising
> perspectives. At least some of what you are proposing will be
> controversial (e.g., omitting 'char' based interfaces) and polling on
> those features may be helpful to facilitate further discussion that will
> improve chances for consensus later. Perhaps think of polls as more of
> an avenue for influence than requests for (technical) answers and guidance.

Oh for sure. I was very active on SG meeting attendance up until I was
no longer able to do so. You may note some of my WG21 papers have "SG14"
in their title, I built up a consensus on a proposed design, and those
papers resulted. Unfortunately I can no longer attend US-timed meetings,
but it's not for a lack of recognition of the value of doing so.

Anyway, looking forward to the discussion at Cologne!

Niall

Received on 2019-07-03 23:39:01