C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Tony V E <tvaneerd_at_[hidden]>
Date: Fri, 6 Sep 2019 16:09:23 -0400
On Fri, Sep 6, 2019 at 3:52 PM Thiago Macieira <thiago_at_[hidden]> wrote:

> On Friday, 6 September 2019 10:49:56 PDT Tony V E wrote:
> > First of all
> >
> > It seems Option 2b is a superset of Option 2a, and is just more work for
> > everyone, with no work saved. ie Windows still needs to support
> > single-bytes, but can use also use dual-bytes.
> > Are we encouraging Windows tools to *only* use dual-bytes and not support
> > single-bytes (ie not have full support)? What's the benefit of 2b?
> > Can we narrow our choices by agreeing 2b isn't worthwhile?
>
> Indeed, it's a superset that spreads the pain by making everyone have to
> implement conversions, for the benefit of the case where a _WIN32 tool
> produces a file that is read by another _WIN32 tool: then it can do pass-
> through.
>
> > Now, overall, if I understand the discussion correctly:
> >
> > - if you encode the raw bytes (narrow or wide), you should add the
> encoding
> > as well (ie "EBCIDIC", etc).
> > This implies every tool needs to support (and translate) every encoding,
> or
> > accept that we will have non-interoperable tools, platform specific
> tools.
> > Also, is the set of encodings finite, or can I add the "TONY" encoding?
>
> There's no need to indicate which encoding was used because the options 2
> encode the raw bytes that are used with the filesystem API. The data is an
> opaque bag of bits.
>

but it is only valid if you use those bits with the same API and encoding
that they came from (if you don't know the encoding).


> If you want to *display* that to the user, then converting to text is
> necessary. But all the tools that display file names have such
> functionality,
> since they already deal with file names obtained from the FS API.
>
> > - if you encode the raw bytes, there might still be cases not covered,
> > might need to fall back to UTF8. It sounds like *no* answer will be
> > guaranteed to work.
>
> Which case could there be that the raw bytes fail but UTF-8 supports? I
> would
> think it's the other way around.
>

the case where the encoding changed. Or the raw bytes are being used with
the wrong FS API.


> > Are there systems where filenames *that developers use* can't be found
> via
> > UTF8?
>
> The problem is what happens when the locale isn't UTF-8, which is common
> enough when LC_ALL=C was set in the environment.
>
>
And how common is that (besides you :-)


But I repeat what I said: I am fine with Option 1 ("file names are text"),
> knowing that there are failure modes. This has been the case for Qt for
> two
> decades. We call those "filesystem corruption" and tell our users to go
> fix
> with a system tool.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel System Software Products
>
>
>
>

-- 
Be seeing you,
Tony

Received on 2019-09-06 22:09:41