C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 06 Sep 2019 18:07:48 -0700
On Friday, 6 September 2019 16:33:03 PDT Niall Douglas wrote:
> > I'm interpreting this in two cases:
> > 1) on Unix, the bag of 8-bit bytes obtained from the FS API can be
> > decoded
> > using UTF-8
> > 2) on Windows, the bag of 16-bit words can be decoded using UTF-16,
> > which means I can encode it to 8-bit with UTF-8
>
> You're excluding ANSI on Windows.

Yes, intentionally.

> I keep bringing it up, because:
>
> int main(int argc, char *argv[])
> {
> std::filesystem::path(argv[1]);
> ...
>
> ... involves a conversion of the system narrow encoding, which is locale
> dependent, to the filesystem native encoding, which on Windows is
> currently incorrectly defined by the standard to only ever be UTF-16
> wchar_t. This is still the case even when _UNICODE is defined. And there
> is a ton of build tooling out there which works with char arrays,
> including on Windows.

The mistake was to use argv. If you're on Windows and you want to deal with
proper file names on the command-line, call GetCommandLineW and get the actual
command-line.

> It's all well and good for Thiago etc to say "you must use wmain()". I
> think P1689 must be a taker when it comes to persuading existing build
> tooling to use their interchange format. If they're using char arrays,
> if they're using main() not wmain(), you need to support that.

Indeed, the proposal for Option 2 is specifically that if _WIN32 is defined,
you must use the W API. The ANSI API is banned, including argv and fopen.

Interestingly, Cygwin/MSYS2 and WSL have shown that it's possible to fix this
on Windows. It requires no kernel modification, just a different C runtime. I
don't claim it's easy, only that there is a solution. (it needs to be coupled
with deprecating and banning the ANSI API)

> Otherwise they're either going to corrupt your JSON on non-US locales,
> which upsets developers. Or they're going to extend your JSON to have
> been correct in the first place. Or they're going to use their own
> interchange format, and say in the docs "don't use the standard JSON
> format, it's broken".

If you corrupt the JSON file, then your JSON encoder is broken in the first
place or you misused the API.

void json_add_string(JsonWriter *, const char *utf8String);

If you pass non-UTF-8 there, you made a mistake. It's a bug in your code. Use
mbsrntoc8s().

> I have not currently decided what LLFIO will do on this. I really hate
> the ANSI APIs. But Billy O' Neal gave me a very convincing motivating
> use case:
>
> int main(int argc, char *argv[])
> {
> auto fh = file({}, argv[1]);
>
> If LLFIO calls the ANSI API here, this "just works" even on Shift-JIS
> and all the other weird legacy encodings Windows supports.
>
> I still haven't brought myself to implement the support, though.

Convert from ANSI on creation.

If that makes it impossible to have an allocation-free class, then an
allocation-free class is impossible.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-07 03:07:53