C++ Logo

sg16

Advanced search

[SG16] Unicode as the basic compiler character set

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 26 Jan 2021 23:29:39 +0100
Hi,

There is a desire to switch the specification of C++
to a "model B" approach as described in the C99 Rationale v5.10,
section 5.2.1.

This paper does that:

https://wiki.edg.com/pub/Wg21telecons2021/SG16/charset.html

(Yes, the intro prose needs more work.)

The new terms introduced here are:

 - translation character set (essentially Unicode)
 - basic character set (used to be "basic source character set")
 - ordinary / wide literal encoding (used to be "execution character set",
which is sub-optimal, because execution environments may vary for
a given executable)

UCNs are translated eagerly outside of literals, but are kept
until phase 7 for literals.


Regarding the problem that we need to retain hex escape sequences
as code units even when faced with string literal concatenation,
I've simply specified that the lexical structure of a string-literal
is retained.

Example: "\xA" "B"
(even after concatenation) consists of two lexical items:
  hexadecimal-escape-sequence and basic-s-char

R"(\u00)" "41"
consists of six lexical items after concatenation:
  r-chars backslash, letter "u", digit "0", digit "0"
  basic-s-chars digit "4" and digit "1"
(no UCN is formed)

This keeps string-literals structurally intact until we need
to make objects from them in phase 7.


Regarding the problem that transcoding to the "execution character set"
is undesirable for diagnostic messages, this is already addressed by
the status quo wording; see [lex.string] p9 and p10:

p9 "Evaluating a string-literal results in a string literal object..."

p10 "String literal objects are initialized with the sequence of code unit values..."

We only "evaluate" at runtime (and maybe at constexpr compile-time),
but we don't "evaluate" the string-literals in static_assert or [[nodiscard]],
so we don't get a string literal object for those latter cases, and thus
we don't get any transcoding. Which is good.


There is a bit of cheating to retain 7 translation phases (6 would suffice).

Jens

Received on 2021-01-26 16:29:44