sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 5 Sep 2019 17:17:21 +0100

On 05/09/2019 16:27, Tom Honermann wrote:
> On 9/5/19 6:51 AM, Niall Douglas wrote:
>> Firstly, NUL is a valid filesystem path codepoint on some platforms, and
>> I'd like to get the standard fixed on that incorrectness in the near
>> future. I think that we can reasonably declare the native path separator
>> codepoint the only invalid filesystem path codepoint, as otherwise
>> filesystem::path doesn't work.
>
> Please don't try and fix this. I don't believe there is any use case
> for support of NUL characters within a path component and, clearly, C,
> C++, POSIX, and Win32 APIs have *never* supported this and existing
> interfaces obviously cannot be updated to accommodate embedded NUL
> characters. Supporting this effectively breaks all code in existence
> that deals with file names with no motivation.

This is a question of correctness, not what is convenient.

There is no reason why POSIX, or Win32, might not support NUL in
filenames in the future. Especially if C introduces a lengthed string.

> I strongly disagree with the view point that NTFS is a byte based
> filesystem. The fact that part of a filesystem neutral interface is
> weirdly designed (perhaps because it supports different filesystems,
> some of which might actually be byte based) does not mean that NTFS
> doesn't store 16-bit code units.

I never said that NTFS stores its filenames in bytes. The NTFS MFT uses
a wchar_t count, so NTFS filenames are always a multiple of two bytes.

But NTFS is not the only filesystem in Windows. And, as I actually
originally said, the Windows filesystem API is byte-based. Non-NTFS
filesystems could support non multiple of two lengthed filenames, and
the Windows filesystem API is just fine with that (though I would agree
that the current Win32 layer would likely round down to the nearest two
multiple).

Future modern i/o in C++ may support byte-multiple filenames on Windows.
For example, UTF-8, with a direct unreencoded path between userspace and
the filing system. No need to hardcode the assumption that all of
Windows will always be wchar_t based forever if unnecessary. Indeed,
given the ever closer integration of the Linux and Windows kernels,
efficiency would demand that much more of the NT kernel works natively
in UTF-8, which it is designed to do just fine.

Niall

Received on 2019-09-05 18:17:26