C++ Logo


Advanced search

Re: [SG16-Unicode] As-if Unicode

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 15 Aug 2019 08:20:30 -0400
That's what the standard refers to now as the internal encoding.


An implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same extended
character expressed in the source file as a *universal-character-name
<http://eel.is/c++draft/lex#nt:universal-character-name>* (e.g., using the
\ uXXXX notation), are handled equivalently except where this replacement
is reverted ([lex.pptoken] <http://eel.is/c++draft/lex#pptoken>) in a raw
string literal. <http://eel.is/c++draft/lex#phases-1.1.sentence-4>

Now, this doesn't quite require that the internal encoding be Unicode. If
I'm reading it correctly, it could be lossy. However, given the other
requirements around u literals, it's somewhat unlikely. It might be worth
exploring making it an explicit requirement that the internal encoding be
some unspecified unicode transform, so even if it's utf-ebcdic, that's ok.

All of this language in the standard seems to have been drafted between 94
and 98, and doesn't correspond well to current nomenclature around
character encodings. It also comes from a time when it wasn't clear that
programs would routinely have to deal with multiple encodings at the same
time during their lifetime, and that one of the most common would be a
multibyte encoding.

On Thu, Aug 15, 2019, 07:55 Lyberta <lyberta_at_[hidden]> wrote:

> There is so much discussion and misunderstandings about C++ charsets in
> the adjacent thread and on the Internet. Maybe we can simplify this a bit.
> I propose we add an "Intermediate Character Set" and define it as
> implementation-defined Unicode encoding form.
> Then we add rules like these:
> When compiling TU, a text in source charset gets converted to
> intermediate charset before preprocessor. This eliminates any ambiguity
> about string literals and comments.
> Pretty much all text operations during compilation work in terms of
> intermediate charset.
> As the last step before writing an object file text data gets converted
> to various "execution" encodings.
> This will allow us to write standardese in the framework of Unicode but
> still allows exotic charsets as input and output.
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-08-15 14:20:45