C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path_view

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Sun, 7 Jul 2019 23:14:56 +0100
On 06/07/2019 22:36, Tom Honermann wrote:
>> I confirmed it with a member of the Windows kernel Filesystem team. As
>> far as they are concerned, a path of:
>>
>> c:\a\b\c\d\e
>>
>> ... which is path components of:
>>
>> c:\a
>> b\c\d
>> e
>>
>> ... is absolutely legitimate and fine.
>
> I suspect this is not an accurate characteristic as it gives the
> impression that such names would not be discouraged. I doubt that any
> Windows kernel developers would advocate use of file names that are not
> supported by Win32.

It isn't portable, in any case. And there is always a good argument for
choosing to limit oneself voluntarily in the name of portability.

That said, if a filename of "b\c\d" is possible, then I personally think
that the correct software design should be able to handle it, though
perhaps not via the simplest front-facing APIs.

>> In hindsight I think it was the wrong choice. Back during the original
>> Boost review of this, Peter Dimov very strongly argued for polymorphic
>> storage in path, so whatever the user supplied was what was stored. He
>> really went to bat arguing for that design instead. But consensus landed
>> on exposing the platform native type publicly, and doing conversion to
>> the native type upon path construction, so Beman chose that instead.
>
> I don't think the std::filesystem::path interface excludes such an
> implementation. Exposing the native type allows an application to be
> tailored for that type so as to avoid conversions.

I would disagree with this interpretation. filesystem::path is very
clearly supposed to be implemented as if
std::basic_string<path::value_type>. That's what the Boost peer review
settled upon after considerable debate. That's how all the
implementations implement it.

>> For path views, if you want to avoid memory copying and unicode
>> translation, you are going to have to #ifdef for your specific platform.
>> If you don't #ifdef, you are explicitly accepting "I don't care about
>> path performance, I want convenience". Same applies to z/OS on this as
>> it does to Windows. I don't personally see any problem here.
>
> I agree regarding the performance trade offs, but that wasn't really my
> point. I'm more concerned with code being simple and portable.
> Excluding a 'char' based interface makes common use cases more
> complicated than necessary.
>
> Programmers do sometimes pass bad strings as file names to 'char' based
> interfaces. I don't buy the argument that we should therefore exclude
> such an interface any more than that we should deprecate references
> because programmers sometimes use them after leaving them dangling.

I think that you and I have very different weighings of the risks
involved here.

> There is nothing wrong on any platform with 'path_view("file")'.

I am afraid that there is. A char filesystem path actually means "please
apply unspecified, platform and locale and user setting specific,
transformations to this byte string during use".

In other words, char filesystem paths, ad extremis, mean "please permute
this with a random number generator into anything at all".

There is a simple solution: ban char paths. Make the user write what
they meant. Then behaviours are less surprising.

This said, the problem with non-byte string paths is that developers
would be right to think "if I say this is in UTF-8, then a UTF-8
comparison will occur". Yet this won't happen, because there is no
mechanism for user mode to tell the kernel that a specific byte string
is in UTF-8, or in any other encoding. So one must be very careful to
set very low expectations with respect to requiring specification of
path encoding.

> Yes,
> 'path_view("共同的秘密")' is not portable. I think it is fine to
> specify that the interpretation of the path name is implementation
> defined as we already do for std::filesystem::path in
> [fs.path.type.cvt]p2.1
> (http://eel.is/c++draft/fs.class.path#fs.path.type.cvt-2.1).

I think that an unhelpful hand wave in the current standard. Moreover,
unlike all this pontificating over potential dangers, char input to
paths is a known footgun which blows up people's code. Char input to
paths is unwise to ever use, ever, on Microsoft Windows. Unfortunately
all that POSIX code makes it all too easy to do so.

> Note that the 'byte' interface is not portable either. NTFS is not a
> byte oriented filesystem; file names are composed of 16-bit code units
> (with few restrictions as we've discussed).

As we have already discussed, this is incorrect. NTFS works exclusively
using memcmp. It couldn't care less what your bytes are.

The IFS layer above may -- optionally -- do some retries of finding an
entry based on mapping the entry in some ways if memcmp fails to find.

> What does it mean to pass a
> sequence of bytes to path_view on Windows? Is each byte widened to 16
> bit? Or are the bytes assumed to be little endian 16-bit sequences? Is
> it then an error if an odd number of bytes is supplied? In my opinion,
> the byte oriented interface is strictly worse than the char oriented
> one. In either case, behavior is implementation defined.

Almost exactly the opposite. The byte API is the only one which
guarantees that user mode code will do nothing to the byte string. It
passes it through, always unmodified.

What happens to that byte string once the kernel has it is up to the kernel.

>> I don't think it's an understatement to say that most of the char-based
>> filesystem::path code currently in the wild which appears to work on
>> both Windows and POSIX is broken. And that's an awful lot of code
>> affecting an awful lot of non-Western folk, who bear the brunt.
>
> I do think that is an understatement unless you can provide evidence
> otherwise.

char inputs to path mean wildly different things on POSIX and on
Windows. Code which appears to work actually only does so for 7-bit ANSI
path strings, and moreover those char strings cannot contain certain
special sequences of characters e.g. :, ;, CON, NUL, and so on.

The wchar_t input to path on Windows is markedly better, but still not
commensurate with 16-bit input to path on POSIX. A "portable" code base
is still going to contain many silent gotchas in paths, especially valid
paths supplied by a program user, which silently break in weird and hard
to debug ways.

If we could wind the clock back, the correct thing that we ought to have
done is made all filesystem paths uninterpretable byte strings. But the
Filesystem TS already took 25 years to get into the standard, and
fighting people's incorrect belief that filesystem paths have anything
to do with human readable text would be a steep hill to climb.

(The same problem occurred with how inherently racy the filesystem is,
yet the Filesystem TS pretends it never changes. Beman told me that he
didn't have it in him to get a Filesystem support into the standard
which supports race-free filesystem. Too hard he said, all for
ideological rather than engineering reasons. People believe the
filesystem isn't concurrently changeable to your changes, and educating
enough of them otherwise to achieve consensus was probably impossible in
the 2000s. It *may* be easier now, thanks to POSIX.2008)

So we are at where we are at, and we must move forwards rather than
looking backwards.

Niall

Received on 2019-07-08 00:15:00