C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Sat, 7 Sep 2019 00:33:03 +0100
> I'm interpreting this in two cases:
> 1) on Unix, the bag of 8-bit bytes obtained from the FS API can be
> decoded
> using UTF-8
> 2) on Windows, the bag of 16-bit words can be decoded using UTF-16,
> which means I can encode it to 8-bit with UTF-8

You're excluding ANSI on Windows. I keep bringing it up, because:

int main(int argc, char *argv[])
{
  std::filesystem::path(argv[1]);
  ...

... involves a conversion of the system narrow encoding, which is locale
dependent, to the filesystem native encoding, which on Windows is
currently incorrectly defined by the standard to only ever be UTF-16
wchar_t. This is still the case even when _UNICODE is defined. And there
is a ton of build tooling out there which works with char arrays,
including on Windows.

> Niall's reply gave me the impression that even with this restriction,
> there would still be problems. Thus my scenario.

I remind the list that I wanted to drop char input from path_view, but
SG16 feedback put it back in.

This kind of crap is why I wanted char input left out.

It's all well and good for Thiago etc to say "you must use wmain()". I
think P1689 must be a taker when it comes to persuading existing build
tooling to use their interchange format. If they're using char arrays,
if they're using main() not wmain(), you need to support that.

Otherwise they're either going to corrupt your JSON on non-US locales,
which upsets developers. Or they're going to extend your JSON to have
been correct in the first place. Or they're going to use their own
interchange format, and say in the docs "don't use the standard JSON
format, it's broken".

Up to the P1689 authors what they want to do.

Just to muddy the waters still further, if path_view gets into the
standard, then the source encoding of the path_view MAY select the
filesystem native encoding used. So, if you supply a char source of
path_view on Windows, you might get the ANSI filesystem APIs on Windows
rather than the Unicode APIs.

I stress the MAY here because it depends on the renderer of the
path_view. They define the interpretation of source, not path_view.

I have not currently decided what LLFIO will do on this. I really hate
the ANSI APIs. But Billy O' Neal gave me a very convincing motivating
use case:

int main(int argc, char *argv[])
{
  auto fh = file({}, argv[1]);

If LLFIO calls the ANSI API here, this "just works" even on Shift-JIS
and all the other weird legacy encodings Windows supports.

I still haven't brought myself to implement the support, though.

Niall

Received on 2019-09-07 01:33:10