C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 6 Sep 2019 13:04:25 -0400
On 9/6/19 8:30 AM, Niall Douglas wrote:
>>> There is no reason why POSIX, or Win32, might not support NUL in
>>> filenames in the future. Especially if C introduces a lengthed string.
>> I disagree that there is no reason. The reason is that supporting this
>> requires all new interfaces. I don't see that happening.
>>
>> A forklift upgrade of the file system apis is not in the realm of
>> possibility, even if C provided a string type that allows embedded nuls.
>> Every program that processes paths is vulnerable to attack with
>> unexpected nuls. Even if POSIX provided APIs it would be fantastically
>> unlikely that vendors would allow their customers to be broken that way,
>> because the old APIs can't be turned off.
> POSIX already allows NUL to appear in paths returned by the OS. This is
> because POSIX code must be a *taker* when it comes to paths supplied by
> others e.g. by other systems, or filing systems, where NUL in path
> components is legal.
Can you please provide a link to some documentation of a filesystem that
allows NUL in path components? I don't mean an it-happens-to-work-on-X
reference, but documentation that states, yes, NUL characters are
supported. I'm sorry to ask, but having sat in on some Austin Group
meetings and observed POSIX implementor complaints about characters that
are problematic in path components (e.g., CR/LF), but would be much less
problematic than NUL, combined with the Austin Group's general
reluctance to standardize anything that isn't wide spread existing
practice, I find it extremely unlikely that the Austin Group would
approve of NUL or '/' characters in path components. But I haven't sat
in on any Austin Group meetings in many years, so perhaps things have
changed.
>
> POSIX bans NUL appearing within paths supplied *to* the OS. And nobody
> is suggesting that the default would change in the future here, just
> that say that open() when fed a string view might gain a flag
> O_BINARYPATH which means that the path supplied is some array of bits
> without character interpretation, so '/' and NUL can appear in it.
>
> Quite a few filing systems already implement this using proprietary
> APIs, because it's very useful. NTFS and ZFS come immediately to mind.
> The ability to support this from standard POSIX code is desirable, and I
> and others are trying to get that over the line, and into standards.
Ok, you mention NTFS and ZFS here, can you provide some links to their
documentation that describes this?
>
> Perhaps what you are not considering is that future storage devices will
> expose their internal key-value stores to the host? Some are already
> available on the market. We'd like to efficiently support those from
> standard code. They offer path-based lookup with performance orders of
> magnitude faster than existing path lookup. Such storage would be
> particularly suitable for build systems, which would create a "lookup
> realm", and fire objects into that realm each with a binary identifier.
> Entire realms can be efficiently cloned, or deleted. It's much faster
> than text-path-based filesystem build artefact stores, because you can
> avoid a kernel transition most of the time, userspace talks directly to
> the storage device.
I admit that this is something I have little exposure to.
>
> Getting back to the OP's original question, I repeat once again, they
> are best storing both the raw byte edition AND a UTF8-attempt at
> conversion, try the raw byte array first, if unfound try the UTF8
> edition converted to the local native filesystem encoding. It's the only
> sensible approach.

I think I agree here. The fallback to the UTF-8 encoded name may
succeed in some cases when the raw code units are passed to an
OS/library interface that mutates it before sending it down to the
filesystem (e.g., MSVC's _fopen()). However, this fallback should
probably be disabled if the producer knows that the UTF-8 representation
is not accurate (e.g., contains a substitution character).

Tom.

>
> Niall
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-09-06 19:04:31