C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Thiago Macieira <thiago_at_[hidden]>
Date: Thu, 15 Aug 2019 07:54:51 -0700
On Wednesday, 14 August 2019 21:27:56 PDT Tom Honermann wrote:
> > I *want* UTF-8, because we have a lot of code that does like
> >
> > QString("é")
> >
> > And our rule is that source code is encoded UTF-8, therefore I expect this
> > constructor to be passed a 2-byte string containing 0xc3, 0xa9.
>
> I don't understand why, just because the source file is UTF-8 encoded,
> that you would expect the string to be UTF-8 at run-time. I can
> understand *wanting* UTF-8, just not the implication that such desire is
> based on the source encoding.

Because it works on Unix and for people using any Windows where the compiler
effectively makes a byte copy of the source to the literal. Think about it:
using CP1251, 1252, etc., the compiler decodes the source into UTF-16 using
something like MultiByteToWideChar, processes, then writes the strings into
the .obj file using WideCharToMultiByte-equivalent. That means the UTF-8
sequence above (bytes 0xc3 0xa9) do get written into the .obj file as 0xc3
0xa9, which is correct UTF-8.

That has worked since time immemorial and continues to, today.

I admit this is a Western-centric view, since it's highly likely the sequence
isn't valid Shift-JIS (is that what Windows uses in Japan?). In order to have
cross-platform code, we'd have had to write QString("\xc3\xa9") and for our
own sources, we did. But our limitation shouldn't be imposed on those who
weren't under the same constraints.

And there was no alternative.

> > This is what
> > GCC, Clang and ICC (at least on Linux and macOS) will do. I need
> > interoperability of the source code with the cross-platform API.
>
> gcc has -finput-charset and -fexec-charset that match the MSVC options,
> but is UTF-8 by default. Clang only supports UTF-8. I don't know about
> ICC.
>
> Since C++11, I would have written the above as `QString(u8"é")` rather
> than requiring that the (presumed) execution encoding be set to UTF-8.

Because the codebases in question are much older than the ability to write
u8"" in UTF-8 sources. Saying "C++11" here is a red herring, since we need
compilers to support it and we need to be able to require those compilers. The
compiler support happened with the /source-charset option, which was added in
MSVC 2015 Update 2 (my commit log says we enabled in Qt in Jan 2017). And we
didn't drop MSVC 2013 until Qt 5.11, released in March 2018.

So you see, we've had little more than a year on the ability to use u8"". But
the requirement that sources be UTF-8 is much older than that. We made that
change when we changed the QString constructor from the local 8 bit encoding
to UTF-8 and that happened in mid 2012, before we could even require C++11.

And be glad we didn't begin using u8"", since that would have broken with
C++20 and char8_t. If we had had a large codebase using u8"", SG16 would have
had to make a different choice regarding the hard break that the introduction
of char8_t is. At least that change is post MS's adption of SG1's feature
detection macros.

> > And if you did:
> > QFile f("é.txt");
> > f.open();
> >
> > It would call CreateFile((wchar_t[])[0xe9, '.', 't', 'x', 't'}, ...),
> > which is the expected behaviour.
>
> That looks to me like the expected behavior in either the case that
> QFile works on execution encoding (and /execution-charset is set or
> defaulted to Windows-1252) or if QFile requires UTF-8 (and
> /execution-charset:utf-8 is specified).

QFile takes a QString input, so it knows nothing about the execution encoding
on Windows (on Unix, it does convert from UTF-16 back to the local 8-bit
encoding, including proper NFD on macOS). My point is that the sequence above,
through the implicit QString, opens the file that was expected.

The difference between that and

        FILE *f = _wfopen(L"é.txt", L"r");

is that the Qt-based one works whether you had the compiler's source-charset
setting configured correctly to match the source's encoding or not, at least
in locales where the compiler effectively byte-copied the source. And since
you *couldn't* configure it to UTF-8 until January 2017, that means the source
above simply couldn't have been written until very recently.

And remember that we had working code in all encodings since 2012 with

        QFile f("\xc3\xa9.txt");

Previously, since 2003, you had to write

        QFile f(QString::fromUtf8("\xc3\xa9.txt"));
or
        QFile f(QString::fromLatin1("\xe9.txt"));

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-08-15 16:54:56