C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?
From: Tom Honermann (tom_at_[hidden])
Date: 2019-08-14 23:27:56


On 8/15/19 12:03 AM, Thiago Macieira wrote:
> On Wednesday, 14 August 2019 19:51:37 PDT Tom Honermann wrote:
>> On 8/14/19 1:58 PM, Thiago Macieira wrote:
>>> This means that using MSVC with the /utf-8 option is the only sane
>>> alternative, but it's not the default.
>> I've been arguing that the /utf-8 option is almost never the right
>> option to use since this sets both the source and execution character
>> encodings and Microsoft does not yet support UTF-8 as the
>> (run-time/system/native) execution encoding.
> That goes back to your OP, but in reality that has no effect. It's perfectly
> fine for the execution encoding to be UTF-8 while the Win32 ACS is something
> else. Just don't use the 8-bit API -- which a proper Win32 application already
> doesn't use anyway.
>
>> I recommend use of
>> /source-charset:utf-8 instead (and perhaps /execution-charset:ascii to
>> ensure that encoded literals have the same meaning across all supported
>> (run-time/system/native) execution encodings). Use of either
>> /source-charset or /execution-charset will implicitly enable
>> /validate-charset which will cause the compiler to issue a warning if a
>> character cannot be encoded in the (presumed) execution encoding.
> Which is exactly why we want /utf-8, not just the source charset. Take this
> example (source is encoded in UTF-8):
>
> extern const char msg[] = "é\u20ac";
> extern const char16_t umsg[] = u"é\u20ac";
> // <https://msvc.godbolt.org/z/NXOGIm>
>
> As you can see from the Godbolt run, the first byte in msg with just -source-
> charset:utf-8 is 0xe9 and then it's followed by a 0x80 (I guess that's the
> Euro symbol in CP1252). That's not UTF-8.
Right, this is exactly what I expect.
>
> I *want* UTF-8, because we have a lot of code that does like
>
> QString("é")
>
> And our rule is that source code is encoded UTF-8, therefore I expect this
> constructor to be passed a 2-byte string containing 0xc3, 0xa9.
I don't understand why, just because the source file is UTF-8 encoded,
that you would expect the string to be UTF-8 at run-time. I can
understand *wanting* UTF-8, just not the implication that such desire is
based on the source encoding.
> This is what
> GCC, Clang and ICC (at least on Linux and macOS) will do. I need
> interoperability of the source code with the cross-platform API.

gcc has -finput-charset and -fexec-charset that match the MSVC options,
but is UTF-8 by default.  Clang only supports UTF-8.  I don't know about
ICC.

Since C++11, I would have written the above as `QString(u8"é")` rather
than requiring that the (presumed) execution encoding be set to UTF-8.

>
> And if you did:
> QFile f("é.txt");
> f.open();
>
> It would call CreateFile((wchar_t[])[0xe9, '.', 't', 'x', 't'}, ...), which is
> the expected behaviour.
>
That looks to me like the expected behavior in either the case that
QFile works on execution encoding (and /execution-charset is set or
defaulted to Windows-1252) or if QFile requires UTF-8 (and
/execution-charset:utf-8 is specified).

Tom.


SG16 list run by sg16-owner@lists.isocpp.org