C++ Logo


Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 06 Sep 2019 12:52:45 -0700
On Friday, 6 September 2019 10:49:56 PDT Tony V E wrote:
> First of all
> It seems Option 2b is a superset of Option 2a, and is just more work for
> everyone, with no work saved. ie Windows still needs to support
> single-bytes, but can use also use dual-bytes.
> Are we encouraging Windows tools to *only* use dual-bytes and not support
> single-bytes (ie not have full support)? What's the benefit of 2b?
> Can we narrow our choices by agreeing 2b isn't worthwhile?

Indeed, it's a superset that spreads the pain by making everyone have to
implement conversions, for the benefit of the case where a _WIN32 tool
produces a file that is read by another _WIN32 tool: then it can do pass-

> Now, overall, if I understand the discussion correctly:
> - if you encode the raw bytes (narrow or wide), you should add the encoding
> as well (ie "EBCIDIC", etc).
> This implies every tool needs to support (and translate) every encoding, or
> accept that we will have non-interoperable tools, platform specific tools.
> Also, is the set of encodings finite, or can I add the "TONY" encoding?

There's no need to indicate which encoding was used because the options 2
encode the raw bytes that are used with the filesystem API. The data is an
opaque bag of bits.

If you want to *display* that to the user, then converting to text is
necessary. But all the tools that display file names have such functionality,
since they already deal with file names obtained from the FS API.

> - if you encode the raw bytes, there might still be cases not covered,
> might need to fall back to UTF8. It sounds like *no* answer will be
> guaranteed to work.

Which case could there be that the raw bytes fail but UTF-8 supports? I would
think it's the other way around.

> Are there systems where filenames *that developers use* can't be found via
> UTF8?

The problem is what happens when the locale isn't UTF-8, which is common
enough when LC_ALL=C was set in the environment.

But I repeat what I said: I am fine with Option 1 ("file names are text"),
knowing that there are failure modes. This has been the case for Qt for two
decades. We call those "filesystem corruption" and tell our users to go fix
with a system tool.

Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-06 21:52:50