sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 6 Jul 2019 17:36:32 -0400

On 7/3/19 5:38 PM, Niall Douglas wrote:
>>>> Note that the SG16 Unicode Direction paper was updated for the
>>>> pre-meeting mailing with relevant discussion concerning encoding of file
>>>> names. Please take a look at P1238R1
>>>> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1238r1.html)
>>>> sections 2.7 and 6.3 and provide any feedback.
>>> It is mostly accurate. Some additional data points are that the NT
>>> kernel permits both NUL and backslashes in filenames. So, if P1031 LLFIO
>>> is in NT kernel path mode, you can literally dump random bytes into a
>>> filename, and it just works (though it may crash Windows Explorer, and
>>> nothing on Windows can either open nor delete the file entry).
>> Interesting, I wasn't aware of that behavior. Is it documented anywhere?
> ReactOS source code :)
>
> I confirmed it with a member of the Windows kernel Filesystem team. As
> far as they are concerned, a path of:
>
> c:\a\b\c\d\e
>
> ... which is path components of:
>
> c:\a
> b\c\d
> e
>
> ... is absolutely legitimate and fine.
I suspect this is not an accurate characteristic as it gives the
impression that such names would not be discouraged. I doubt that any
Windows kernel developers would advocate use of file names that are not
supported by Win32.
> You must remember that they do
> not care for the Win32 layer one bit, and it's merely one of many NT
> subsystems in any case.
I find statements like this less than helpful in a technical
discussion. How (some of) the Windows kernel developers feel about the
design of the Win32 layer isn't relevant to how code is written in practice.
>
>> Agreed, though #ifdef isn't necessarily required. The intent in
>> defining std::filesystem::path::value_type and
>> std::filesystem::path::string_type is to provide a generic mechanism for
>> working with the different types. I acknowledge this is a leaky
>> abstraction and, due to the lack of convenient conversion methods,
>> cumbersome to use effectively.
> In hindsight I think it was the wrong choice. Back during the original
> Boost review of this, Peter Dimov very strongly argued for polymorphic
> storage in path, so whatever the user supplied was what was stored. He
> really went to bat arguing for that design instead. But consensus landed
> on exposing the platform native type publicly, and doing conversion to
> the native type upon path construction, so Beman chose that instead.
I don't think the std::filesystem::path interface excludes such an
implementation. Exposing the native type allows an application to be
tailored for that type so as to avoid conversions.
>> We'll talk more about this at the meeting, but something to think about
>> in the mean time. Consider my favorite not-all-platforms-are-like-that
>> operating system: z/OS. z/OS is EBCDIC based and supports a POSIX
>> conforming environment. Like other POSIX systems, file names are
>> sequences of bytes with (EBCDIC) '/' and '\0' reserved. In practice,
>> file names are encoded in some locale dependent EBCDIC code page. On
>> many POSIX systems, it may be expected that the 'char8_t' interface
>> would just pass those bytes straight through to the filesystem, but that
>> won't be true for z/OS. If a 'char' interface is not supported,
>> programmers will have to use the 'byte' interface if they want to avoid
>> the Unicode conversion costs and ensuing issues (errors due to
>> ill-formed names, errors due to unmappable characters and/or clashes due
>> to use of substitution characters, etc...). I suspect that most file
>> names use only characters from the basic source character set. Omitting
>> a 'char' interface thereby penalizes what is probably the most common
>> use case.
> In my opinion, standard library implementers who care about high path
> performance on z/OS are free to subclass path_view, and patch in char
> input as a proprietary extension for that system only (I would
> personally prefer a ebcdic_t character type, not char). End users on
> z/OS can also subclass path_view, and implement a char constructor using
> the byte constructor.
>
> For path views, if you want to avoid memory copying and unicode
> translation, you are going to have to #ifdef for your specific platform.
> If you don't #ifdef, you are explicitly accepting "I don't care about
> path performance, I want convenience". Same applies to z/OS on this as
> it does to Windows. I don't personally see any problem here.

I agree regarding the performance trade offs, but that wasn't really my
point. I'm more concerned with code being simple and portable.
Excluding a 'char' based interface makes common use cases more
complicated than necessary.

Programmers do sometimes pass bad strings as file names to 'char' based
interfaces. I don't buy the argument that we should therefore exclude
such an interface any more than that we should deprecate references
because programmers sometimes use them after leaving them dangling.
There is nothing wrong on any platform with 'path_view("file")'. Yes,
'path_view("共同的秘密")' is not portable. I think it is fine to specify
that the interpretation of the path name is implementation defined as we
already do for std::filesystem::path in [fs.path.type.cvt]p2.1
(http://eel.is/c++draft/fs.class.path#fs.path.type.cvt-2.1).

Note that the 'byte' interface is not portable either. NTFS is not a
byte oriented filesystem; file names are composed of 16-bit code units
(with few restrictions as we've discussed). What does it mean to pass a
sequence of bytes to path_view on Windows? Is each byte widened to 16
bit? Or are the bytes assumed to be little endian 16-bit sequences? Is
it then an error if an odd number of bytes is supplied? In my opinion,
the byte oriented interface is strictly worse than the char oriented
one. In either case, behavior is implementation defined.

>
>> I do agree that programmers get file names wrong frequently. But the
>> 'char' interfaces can be, and are often, used correctly and are the
>> dominant (portable / cross-platform) interface in use today.
> I don't think it's an understatement to say that most of the char-based
> filesystem::path code currently in the wild which appears to work on
> both Windows and POSIX is broken. And that's an awful lot of code
> affecting an awful lot of non-Western folk, who bear the brunt.
I do think that is an understatement unless you can provide evidence
otherwise.
>
> In my opinion, we need to care a lot more about brokenness for 80% of
> all code portable to the two major OSs rather than inefficiency for the
> 1% of code which will be on a EBCDIC native system and is running C++ 23.
I agree. Again, my argument was not based on performance.
>
> (If I were wearing my WG14 hat, I'd care a lot more about minor systems.
> But for WG21, there are not a lot of minor systems which will ever run
> C++ 23. Almost all future C++ will run on of the big four OS kernels,
> using one of the three major C++ compilers, on one of the two major CPU
> architectures. As a former embedded systems developer, I'm more than
> happy to support niche embedded devices, but the days of PIC are rapidly
> giving forth to low end ARM CPUs. It's becoming a real monotheism out
> there. And while that's sad, it does enable us to optimise software
> design accordingly)
This characteristic is not accurate in my experience. At Coverity, we
see plenty of use of C++ in embedded systems, including modern dialects
and I believe such usage is increasing. While I can't share details, we
are currently adding support for a complicated C++ language extension
for an embedded compiler (not related to Clang, gcc, Microsoft, or
Intel) with support for many architectures due to customer demand.
Ecosystems change. I think few would have predicted back around 2000
that Windows would have less than 12% market share on shipped devices in
2015, yet that is what has happened. And people have predicted the
death of the mainframe for my entire career. The market doesn't seem to
care about predictions or what is in vogue today.

Tom.

Received on 2019-07-06 23:36:36