On Wed, Aug 14, 2019, 18:59 Davis Herring via Core <core@lists.isocpp.org> wrote:

> u8"é" is ambiguous. Both people and the compiler may interpret that in a
> variety of ways. Notably if I have utf-8 in that file, which I wrote on
> Linux, but then the msvc compiler thinks it's windows 1252...
> Mojibake.

We have a recursive example of bytes/characters confusion here. If you
want to say that the bytes 75 38 22 c3 a9 22 (because you "have utf-8 in
that file") are ambiguous, of course they are, but so is 5c 41 unless
you restrict to ASCII/Latin-*/UTF-8. You always have to arrange for
your compiler to know which characters are signified by the bytes in
your source file, and having some of them be non-ASCII doesn't
fundamentally change anything (even though in practice it makes it harder).

Your message doesn't contain those bytes anyway; since it contains a header

Content-Type: text/plain; charset="UTF-8"

it's appropriate to say that you wrote 5 (abstract) characters: LATIN
SMALL LETTER U, DIGIT EIGHT, QUOTATION MARK, LATIN SMALL LETTER E WITH
ACUTE, and QUOTATION MARK again. (Of course, you could also have
written LATIN SMALL LETTER E and COMBINING ACUTE ACCENT; that's a
different sort of ambiguity.)

It is probably best to avoid the term "character" and derivatives when discussing Unicode since it itself is ambiguous. Those are all codepoints. "LATIN SMALL LETTER E WITH ACUTE" is the same grapheme (aka "user percived character) as "LATIN SMALL LETTER E and COMBINING ACUTE ACCENT", just represented in a different way. But they should still generally be treated identically regardless of which normal form they are encoded to.

This also avoids an ambiguity where c++ terminology expects a "character" to be a fixed size object, while graphemes are variably-sized in Unicode. Codepoints are fixed size, but they aren't useful to work with unless you are doing one of the defined Unicode algorithms, so they shouldn't be emphasized in interfaces for ordinary developers.