C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Tony V E <tvaneerd_at_[hidden]>
Date: Fri, 6 Sep 2019 13:49:56 -0400
First of all

It seems Option 2b is a superset of Option 2a, and is just more work for
everyone, with no work saved. ie Windows still needs to support
single-bytes, but can use also use dual-bytes.
Are we encouraging Windows tools to *only* use dual-bytes and not support
single-bytes (ie not have full support)? What's the benefit of 2b?
Can we narrow our choices by agreeing 2b isn't worthwhile?

Now, overall, if I understand the discussion correctly:

- if you encode the raw bytes (narrow or wide), you should add the encoding
as well (ie "EBCIDIC", etc).
This implies every tool needs to support (and translate) every encoding, or
accept that we will have non-interoperable tools, platform specific tools.
Also, is the set of encodings finite, or can I add the "TONY" encoding?

- if you encode the raw bytes, there might still be cases not covered,
might need to fall back to UTF8. It sounds like *no* answer will be
guaranteed to work.

So let's go with UTF8, and tell tools not to spit out files that can't be
found via UTF8. How many of the tools we currently use already have those
limitations?

Lastly,

I think, since C++ is a "systems" language, there may be value in APIs that
expose the full range of filenames that the OS can handle. But that's a
separate discussion, I think.
The filenames for tool interchange don't need to support everything. They
only need to support what is actually used.

Are there systems where filenames *that developers use* can't be found via
UTF8?

P.S. I think that if we say UTF8, all the tools will fall into line, and
warn users if they ever encounter a filename that can't be found via UTF8.


On Fri, Sep 6, 2019 at 1:23 PM Thiago Macieira <thiago_at_[hidden]> wrote:

> On Friday, 6 September 2019 06:38:45 PDT Brad King wrote:
> > - UTF-8. This is allowed *only if a lossless round trip* is possible
> > between the filesystem's native binary sequence and UTF-8. E.g. on
> > Windows we should not have to require the full general format to
> represent
> > a simple path like "a.cxx" just because the filesystem APIs use wide
> chars.
>
> Hello Brad
>
> The problem is that the filesystem's native binary sequence is unspecified
> and
> can fail to match between programs running at the same time as well as
> different invocations of the same program. So your requirement that it be
> lossless is insufficient to ensure reproduceability.
>
> So I repeat what I said to Niall: choose one only. If you allow the
> Unicode
> text to be authoritative under any scenario, that means you're allowing
> failures to occur. In that case, I recommend choosing Option 1 and using
> *only* Unicode text and "damn the torpedoes".
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel System Software Products
>
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>


-- 
Be seeing you,
Tony

Received on 2019-09-06 19:50:14