sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Fri, 6 Sep 2019 13:30:54 +0100

>> There is no reason why POSIX, or Win32, might not support NUL in
>> filenames in the future. Especially if C introduces a lengthed string.
>
> I disagree that there is no reason. The reason is that supporting this
> requires all new interfaces. I don't see that happening.
>
> A forklift upgrade of the file system apis is not in the realm of
> possibility, even if C provided a string type that allows embedded nuls.
> Every program that processes paths is vulnerable to attack with
> unexpected nuls. Even if POSIX provided APIs it would be fantastically
> unlikely that vendors would allow their customers to be broken that way,
> because the old APIs can't be turned off.

POSIX already allows NUL to appear in paths returned by the OS. This is
because POSIX code must be a *taker* when it comes to paths supplied by
others e.g. by other systems, or filing systems, where NUL in path
components is legal.

POSIX bans NUL appearing within paths supplied *to* the OS. And nobody
is suggesting that the default would change in the future here, just
that say that open() when fed a string view might gain a flag
O_BINARYPATH which means that the path supplied is some array of bits
without character interpretation, so '/' and NUL can appear in it.

Quite a few filing systems already implement this using proprietary
APIs, because it's very useful. NTFS and ZFS come immediately to mind.
The ability to support this from standard POSIX code is desirable, and I
and others are trying to get that over the line, and into standards.

Perhaps what you are not considering is that future storage devices will
expose their internal key-value stores to the host? Some are already
available on the market. We'd like to efficiently support those from
standard code. They offer path-based lookup with performance orders of
magnitude faster than existing path lookup. Such storage would be
particularly suitable for build systems, which would create a "lookup
realm", and fire objects into that realm each with a binary identifier.
Entire realms can be efficiently cloned, or deleted. It's much faster
than text-path-based filesystem build artefact stores, because you can
avoid a kernel transition most of the time, userspace talks directly to
the storage device.

Getting back to the OP's original question, I repeat once again, they
are best storing both the raw byte edition AND a UTF8-attempt at
conversion, try the raw byte array first, if unfound try the UTF8
edition converted to the local native filesystem encoding. It's the only
sensible approach.

Niall

Received on 2019-09-06 14:30:58