C++ Logo

SG16

Advanced search

Subject: Updates for D2314R1: Character sets and encodings
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-02-25 02:27:26


https://wiki.edg.com/pub/Wg21virtual2021-02/SG16/d2314r1.html

In response to yesterday's discussion:

 - header-names can now contain any character; they are mapped to files
in an implementation-defined manner, so it's up to the implementation what
it does with a character sequence that looks like a UCN.

 - The definition of execution (wide) character set was moved to [character.seq],
including defining what "locale-specific" means.

 - Dropped the (recently added) requirement about encoding consistency
between the literal encoding and the execution (runtime) encoding
(reflects existing practice).

New text:

"The execution character set and the execution wide-character set are supersets
of the basic literal character set (5.3 [lex.charset]). The encodings of the
execution character sets and the sets of additional elements (if any) are
locale-specific. [ Note: The encoding of the execution character sets can be
unrelated to any literal encoding. -- end note ]"

 - Hubert's observation that code unit semantics was changed has been fixed;
the text now reads

"A literal encoding encodes each element of the basic literal character
set as a single code unit with non-negative value, distinct from the
code unit for any other such element. [ Note: A character not in the
basic literal character set can be encoded with more than one code unit;
the value of such a code unit can be the same as that of a code unit
for an element of the basic literal character set. -- end note ]."

The remaining differences with Corentin's P2297R0 are

 - basic literal character set

There are now four uses of the term in my paper, so it seems to be a useful
descriptive tool. (Suggestions to unify "basic character set" and
"basic literal character set" would imply semantic changes to the status
quo or would use more words, I believe.)

 - translation character set

We agree there is no difference on the (intended) semantics either way;
I believe this is simply a question of presentation in the standard.
My definition (aligned with ISO 10646 terminology) currently reads:

The translation character set consists of the following elements:

 - each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
 - a distinct character for each UCS scalar value where no named character is assigned.

Jens


SG16 list run by sg16-owner@lists.isocpp.org