C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?
From: Thiago Macieira (thiago_at_[hidden])
Date: 2019-08-14 23:03:18


On Wednesday, 14 August 2019 19:51:37 PDT Tom Honermann wrote:
> On 8/14/19 1:58 PM, Thiago Macieira wrote:
> > This means that using MSVC with the /utf-8 option is the only sane
> > alternative, but it's not the default.
>
> I've been arguing that the /utf-8 option is almost never the right
> option to use since this sets both the source and execution character
> encodings and Microsoft does not yet support UTF-8 as the
> (run-time/system/native) execution encoding.

That goes back to your OP, but in reality that has no effect. It's perfectly
fine for the execution encoding to be UTF-8 while the Win32 ACS is something
else. Just don't use the 8-bit API -- which a proper Win32 application already
doesn't use anyway.

> I recommend use of
> /source-charset:utf-8 instead (and perhaps /execution-charset:ascii to
> ensure that encoded literals have the same meaning across all supported
> (run-time/system/native) execution encodings). Use of either
> /source-charset or /execution-charset will implicitly enable
> /validate-charset which will cause the compiler to issue a warning if a
> character cannot be encoded in the (presumed) execution encoding.

Which is exactly why we want /utf-8, not just the source charset. Take this
example (source is encoded in UTF-8):

extern const char msg[] = "é\u20ac";
extern const char16_t umsg[] = u"é\u20ac";
// <https://msvc.godbolt.org/z/NXOGIm>

As you can see from the Godbolt run, the first byte in msg with just -source-
charset:utf-8 is 0xe9 and then it's followed by a 0x80 (I guess that's the
Euro symbol in CP1252). That's not UTF-8.

I *want* UTF-8, because we have a lot of code that does like

        QString("é")

And our rule is that source code is encoded UTF-8, therefore I expect this
constructor to be passed a 2-byte string containing 0xc3, 0xa9. This is what
GCC, Clang and ICC (at least on Linux and macOS) will do. I need
interoperability of the source code with the cross-platform API.

And if you did:
        QFile f("é.txt");
        f.open();

It would call CreateFile((wchar_t[])[0xe9, '.', 't', 'x', 't'}, ...), which is
the expected behaviour.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

SG16 list run by sg16-owner@lists.isocpp.org