sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Sat, 7 Sep 2019 10:07:33 +0100

On 07/09/2019 02:07, Thiago Macieira wrote:
> On Friday, 6 September 2019 16:33:03 PDT Niall Douglas wrote:
>>> I'm interpreting this in two cases:
>>> 1) on Unix, the bag of 8-bit bytes obtained from the FS API can be
>>> decoded
>>> using UTF-8
>>> 2) on Windows, the bag of 16-bit words can be decoded using UTF-16,
>>> which means I can encode it to 8-bit with UTF-8
>>
>> You're excluding ANSI on Windows.
>
> Yes, intentionally.
>
>> I keep bringing it up, because:
>>
>> int main(int argc, char *argv[])
>> {
>> std::filesystem::path(argv[1]);
>> ...
>>
>> ... involves a conversion of the system narrow encoding, which is locale
>> dependent, to the filesystem native encoding, which on Windows is
>> currently incorrectly defined by the standard to only ever be UTF-16
>> wchar_t. This is still the case even when _UNICODE is defined. And there
>> is a ton of build tooling out there which works with char arrays,
>> including on Windows.
>
> The mistake was to use argv. If you're on Windows and you want to deal with
> proper file names on the command-line, call GetCommandLineW and get the actual
> command-line.

No, no, no.

We are getting within a cat's whisker of UTF-8 being the default narrow
encoding on Windows for new Visual Studio projects i.e. all the ANSI and
char APIs on Windows would default to speaking UTF-8 in new or upgraded
code. What you propose would ruin that effort.

You may not be aware, but after a discussion with some Microsoft folk, I
went ahead and submitted that as a feature request for the next major
release of Visual Studio. And from what I am told, they are seriously
considering it. The Windows console supports it, Windows supports it,
the MSVCRT runtime supports it when asked. All someone needs to do is
flip the switch for new projects targeting latest Windows 10 only, and
we've done it.

Having char = utf8 across all the major platforms would be an *enormous*
win. Please aid that effort.

(If P1689 authors wish to insist that char = utf8 in tooling, I would
applaud such a bold stance)

>> I have not currently decided what LLFIO will do on this. I really hate
>> the ANSI APIs. But Billy O' Neal gave me a very convincing motivating
>> use case:
>>
>> int main(int argc, char *argv[])
>> {
>> auto fh = file({}, argv[1]);
>>
>> If LLFIO calls the ANSI API here, this "just works" even on Shift-JIS
>> and all the other weird legacy encodings Windows supports.
>>
>> I still haven't brought myself to implement the support, though.
>
> Convert from ANSI on creation.
>
> If that makes it impossible to have an allocation-free class, then an
> allocation-free class is impossible.

All the ANSI Windows APIs always dynamically allocate memory in any case.

But it's not as easy as a simple convert from ANSI on creation. Some of
the ANSI APIs may not behave the same as the Unicode editions when fed
an ANSI to Unicode converted path. I have yet to do an audit i.e. run
through the ReactOS source code checking for surprises. I have a
userbase who don't want to see breakage.

Ultimately it's free time. I get very little of it outside of work, so
progress is exceptionally slow. Most of my recent weeks have been
consumed with deterministic exceptions, for Ben Craig's execution time
overhead paper, and a consensus WG14 and WG21 proposal.

Niall

Received on 2019-09-07 11:07:39