sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 5 Sep 2019 13:53:54 -0400

On 9/5/19 12:17 PM, Niall Douglas wrote:
> On 05/09/2019 16:27, Tom Honermann wrote:
>> On 9/5/19 6:51 AM, Niall Douglas wrote:
>>> Firstly, NUL is a valid filesystem path codepoint on some platforms, and
>>> I'd like to get the standard fixed on that incorrectness in the near
>>> future. I think that we can reasonably declare the native path separator
>>> codepoint the only invalid filesystem path codepoint, as otherwise
>>> filesystem::path doesn't work.
>> Please don't try and fix this. I don't believe there is any use case
>> for support of NUL characters within a path component and, clearly, C,
>> C++, POSIX, and Win32 APIs have *never* supported this and existing
>> interfaces obviously cannot be updated to accommodate embedded NUL
>> characters. Supporting this effectively breaks all code in existence
>> that deals with file names with no motivation.
> This is a question of correctness, not what is convenient.
It is also a question of what is useful.
>
> There is no reason why POSIX, or Win32, might not support NUL in
> filenames in the future. Especially if C introduces a lengthed string.
I disagree that there is no reason. The reason is that supporting this
requires all new interfaces. I don't see that happening.
>
>> I strongly disagree with the view point that NTFS is a byte based
>> filesystem. The fact that part of a filesystem neutral interface is
>> weirdly designed (perhaps because it supports different filesystems,
>> some of which might actually be byte based) does not mean that NTFS
>> doesn't store 16-bit code units.
> I never said that NTFS stores its filenames in bytes. The NTFS MFT uses
> a wchar_t count, so NTFS filenames are always a multiple of two bytes.
Correct, you didn't. My bad.
>
> But NTFS is not the only filesystem in Windows. And, as I actually
> originally said, the Windows filesystem API is byte-based. Non-NTFS
> filesystems could support non multiple of two lengthed filenames, and
> the Windows filesystem API is just fine with that (though I would agree
> that the current Win32 layer would likely round down to the nearest two
> multiple).

Right. And for this reason, I support a handle based approach to path
handling, but I don't agree with the notion that the Windows filesystem
API is fundamentally byte based just because the filesystem API has a
single entry point for communicating with a variety of filesystem
drivers. In my opinion, Win32 defines what are acceptable path names.
Perhaps this is where we disagree.

>
> Future modern i/o in C++ may support byte-multiple filenames on Windows.
> For example, UTF-8, with a direct unreencoded path between userspace and
> the filing system. No need to hardcode the assumption that all of
> Windows will always be wchar_t based forever if unnecessary. Indeed,
> given the ever closer integration of the Linux and Windows kernels,
> efficiency would demand that much more of the NT kernel works natively
> in UTF-8, which it is designed to do just fine.

I agree with this. This is why I support a handle approach rather than
a byte based approach.

Tom.

Received on 2019-09-05 19:53:57