sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 06 Sep 2019 15:03:06 -0700

On Friday, 6 September 2019 14:09:44 PDT Tony V E wrote:
> > The encoding is only needed when converting raw bytes to text. Since
> > there's
> > no conversion, the raw bytes are passed through unmodified from payload to
> > filesystem API and from filesystem API to the payload.
>
> If I know which API it was from, and have it available to me. And the
> filesystem encoding hasn't changed since then. Niall gives me the
> impression that can change. (Or is that only the display encoding that can
> change?)

The *native* API that you have access to. On Unix systems, that's the POSIX
API - open(), opendir(), readdir(), etc. On Windows, that's the Win32 API
(CreateFileW, FindNextW, etc.). I don't know if Windows kernel API is
relevant.

The filesystem encoding never changes, since the bytes-on-disk that the FS
used to store the name don't. What changes is how you interpret those bytes.
And unfortunately, on Windows, the POSIX and C library API are emulation,
which indeed can change. That's why I am saying that Windows applications must
not use the C and C++ standard API.

std::filesystem muddies the waters a little bit because it can call the native
API on Windows and bypass the emulation layer. But having never used it (at
all, ever), I simply can't offer an opinion on whether it can be used or how
it can be safely used. Until someone provides authoritative explanation, the
ISO C++ paper will have to say "don't use the ISO C++ API".

> I know you listed all the rules for many scenarios (on linux do..., on MS
> do...) but it seems a bit precarious to me. What happens when a new FS API
> comes around, or some other OS, EBCIDIC, etc?

Fair question.

> How portable do we want/need this interchange files to be?

We need it to be portable to other applications running on the same OS and we
need a locale-independent method of transform from the payload format to the
FS API. On Unix, that's the identity transform. For Windows, it's CESU-8
encoding of the 16-bit wchar_t string.

If you want a concession, here's one:

If the filename you obtained from the FS API was valid UTF of the width in
question (UTF-8 on Unix, UTF-16 on Windows), then store it as a text string.
Otherwise, store as a byte array. Note how this only affects the producer. The
consumer is still doing exactly what I outlined above: pass-through on Unix
and CESU-8 decoding on Windows.

I don't recommend this because the vast majority of file names *will* fall
into this concession, meaning that 99%+ of the SG15 payload files created will
use text strings. That means few tools will ever write the code for and test
the corner cases. We get the #pragma once problem: if usually doesn't fail,
but when it does, it's an unexpected failure, with little context, in a single
person's machine who wasn't the one writing the code that failed.

PS: the CBOR encoding difference between a text string and the byte array
containing the UTF-8 encoding of that string is a single bit.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-07 00:03:10