C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?
From: Corentin (corentin.jabot_at_[hidden])
Date: 2019-08-14 11:54:16


On Wed, Aug 14, 2019, 5:59 PM Davis Herring <herring_at_[hidden]> wrote:

> > u8"é" is ambiguous. Both people and the compiler may interpret that in a
> > variety of ways. Notably if I have utf-8 in that file, which I wrote on
> > Linux, but then the msvc compiler thinks it's windows 1252...
> > Mojibake.
>
> We have a recursive example of bytes/characters confusion here. If you
> want to say that the bytes 75 38 22 c3 a9 22 (because you "have utf-8 in
> that file") are ambiguous, of course they are, but so is 5c 41 unless
> you restrict to ASCII/Latin-*/UTF-8. You always have to arrange for
> your compiler to know which characters are signified by the bytes in
> your source file, and having some of them be non-ASCII doesn't
> fundamentally change anything (even though in practice it makes it harder).
>
> Your message doesn't contain those bytes anyway; since it contains a header
>
> Content-Type: text/plain; charset="UTF-8"
>
> it's appropriate to say that you wrote 5 (abstract) characters: LATIN
> SMALL LETTER U, DIGIT EIGHT, QUOTATION MARK, LATIN SMALL LETTER E WITH
> ACUTE, and QUOTATION MARK again. (Of course, you could also have
> written LATIN SMALL LETTER E and COMBINING ACUTE ACCENT; that's a
> different sort of ambiguity.)
>

Yet there was no ambiguity because as you mentioned the encoding
information was not lost.
But yes, I have a tendency to assume utf8 :/

>
> Davis
>
> --
> This product is sold by volume, not by mass. If it appears too dense or
> too sparse, it is because mass-energy conversion has occurred during
> shipping.
>



SG16 list run by sg16-owner@lists.isocpp.org