C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Mathias Stearn <redbeard0531+isocpp_at_[hidden]>
Date: Thu, 15 Aug 2019 09:11:00 +0300
On Wed, Aug 14, 2019, 18:59 Davis Herring via Core <core_at_[hidden]>
wrote:

> > u8"é" is ambiguous. Both people and the compiler may interpret that in a
> > variety of ways. Notably if I have utf-8 in that file, which I wrote on
> > Linux, but then the msvc compiler thinks it's windows 1252...
> > Mojibake.
>
> We have a recursive example of bytes/characters confusion here. If you
> want to say that the bytes 75 38 22 c3 a9 22 (because you "have utf-8 in
> that file") are ambiguous, of course they are, but so is 5c 41 unless
> you restrict to ASCII/Latin-*/UTF-8. You always have to arrange for
> your compiler to know which characters are signified by the bytes in
> your source file, and having some of them be non-ASCII doesn't
> fundamentally change anything (even though in practice it makes it harder).
>
> Your message doesn't contain those bytes anyway; since it contains a header
>
> Content-Type: text/plain; charset="UTF-8"
>
> it's appropriate to say that you wrote 5 (abstract) characters: LATIN
> SMALL LETTER U, DIGIT EIGHT, QUOTATION MARK, LATIN SMALL LETTER E WITH
> ACUTE, and QUOTATION MARK again. (Of course, you could also have
> written LATIN SMALL LETTER E and COMBINING ACUTE ACCENT; that's a
> different sort of ambiguity.)
>

It is probably best to avoid the term "character" and derivatives when
discussing Unicode since it itself is ambiguous. Those are all codepoints.
"LATIN SMALL LETTER E WITH ACUTE" is the same grapheme (aka "user percived
character) as "LATIN SMALL LETTER E and COMBINING ACUTE ACCENT", just
represented in a different way. But they should still generally be treated
identically regardless of which normal form they are encoded to.

This also avoids an ambiguity where c++ terminology expects a "character"
to be a fixed size object, while graphemes are variably-sized in Unicode.
Codepoints are fixed size, but they aren't useful to work with unless you
are doing one of the defined Unicode algorithms, so they shouldn't be
emphasized in interfaces for ordinary developers.

Received on 2019-08-15 08:11:14