C++ Logo


Advanced search

Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Thiago Macieira <thiago_at_[hidden]>
Date: Wed, 14 Aug 2019 10:58:42 -0700
On Tuesday, 13 August 2019 15:29:29 PDT keld_at_[hidden] wrote:
> > I guess you never used windows?
> I have not done much programming on window systems. I sometimes lived on
> administering them. What are the problems wrt to this?

Two problems:

1) there are actually three different, active character sets: the DOS
codepage, the 8-bit "ANSI" codepage, and the 16-bit wide character codepage.
For everyone except Niall, the 16-bit codepage is always UTF-16 (he'll tell
you it's actually binary 16-bit, with no surrogate interpretation).

The DOS and 8-bit "ANSI" codepages are different 8-bit encodings. But I think
we can leave the DOS codepage in the past, since it's much less relevant these

That leaves the problem that the 8-bit encoding is *not* UTF-8, for the vast
majority of people. I read somewhere that Vietnamese Windows uses UTF-8, but
for almost everyone else it's usually a Windows-specific encoding. The one
used by English Windows is CP1252, which mostly matches ISO-8859-1, but
encodes different things in the 0x80-0x9F range.

The big problem with this is that the entire C API, like fopen() and printf(),
and the POSIX-imported API like _open(), is using the 8-bit "ANSI" encoding.
Since C++ builds on those, we are similarly affected. This also means that
fopen() cannot all files in the system, main()'s argv does not receive the
full command-line, etc.

2) MSVC has the "traditional" interpretation of the source and execution
charsets. Unlike GCC and Clang, it will not do the pass-through of source
bytes into narrow character string literals. And since wide-character literals
are fairly common due to the 16-bit W API, the chances of mojibake are
actually considerable.

Worse, because the entire source code is read using the system's 8-bit ANSI
encoding, you can produce uncompileable sources with *comments*. For example,
if Corentin's friend Bjørn had in his source:

 // Copyright (C) 2019 Bjørn Bjørnsen

Then his friend Yamada Tarō with a Japanese Windows might not be able to
compile the file because the ø sequence (whether UTF-8 or Latin1 or Latin9) is
not valid. I'm not making this up. We had this problem in Qt because of a
copyright line (the ä in "Klarälvdalens Datakonsult AB", and ä is not "ae" in
Swedish). Note how I did not use ©.

This means that using MSVC with the /utf-8 option is the only sane
alternative, but it's not the default.

Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-08-14 19:58:52