Subject: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?
From: Thiago Macieira (thiago_at_[hidden])
Date: 2019-08-14 12:58:42
On Tuesday, 13 August 2019 15:29:29 PDT keld_at_[hidden] wrote:
> > I guess you never used windows?
> I have not done much programming on window systems. I sometimes lived on
> administering them. What are the problems wrt to this?
1) there are actually three different, active character sets: the DOS
codepage, the 8-bit "ANSI" codepage, and the 16-bit wide character codepage.
For everyone except Niall, the 16-bit codepage is always UTF-16 (he'll tell
you it's actually binary 16-bit, with no surrogate interpretation).
The DOS and 8-bit "ANSI" codepages are different 8-bit encodings. But I think
we can leave the DOS codepage in the past, since it's much less relevant these
That leaves the problem that the 8-bit encoding is *not* UTF-8, for the vast
majority of people. I read somewhere that Vietnamese Windows uses UTF-8, but
for almost everyone else it's usually a Windows-specific encoding. The one
used by English Windows is CP1252, which mostly matches ISO-8859-1, but
encodes different things in the 0x80-0x9F range.
The big problem with this is that the entire C API, like fopen() and printf(),
and the POSIX-imported API like _open(), is using the 8-bit "ANSI" encoding.
Since C++ builds on those, we are similarly affected. This also means that
fopen() cannot all files in the system, main()'s argv does not receive the
full command-line, etc.
2) MSVC has the "traditional" interpretation of the source and execution
charsets. Unlike GCC and Clang, it will not do the pass-through of source
bytes into narrow character string literals. And since wide-character literals
are fairly common due to the 16-bit W API, the chances of mojibake are
Worse, because the entire source code is read using the system's 8-bit ANSI
encoding, you can produce uncompileable sources with *comments*. For example,
if Corentin's friend BjÃ¸rn had in his source:
// Copyright (C) 2019 BjÃ¸rn BjÃ¸rnsen
Then his friend Yamada TarÅ with a Japanese Windows might not be able to
compile the file because the Ã¸ sequence (whether UTF-8 or Latin1 or Latin9) is
not valid. I'm not making this up. We had this problem in Qt because of a
copyright line (the Ã¤ in "KlarÃ¤lvdalens Datakonsult AB", and Ã¤ is not "ae" in
Swedish). Note how I did not use Â©.
This means that using MSVC with the /utf-8 option is the only sane
alternative, but it's not the default.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel System Software Products
SG16 list run by firstname.lastname@example.org