C++ Logo


Advanced search

Subject: Unicode as the basic compiler character set
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-01-26 16:29:39


There is a desire to switch the specification of C++
to a "model B" approach as described in the C99 Rationale v5.10,
section 5.2.1.

This paper does that:


(Yes, the intro prose needs more work.)

The new terms introduced here are:

 - translation character set (essentially Unicode)
 - basic character set (used to be "basic source character set")
 - ordinary / wide literal encoding (used to be "execution character set",
which is sub-optimal, because execution environments may vary for
a given executable)

UCNs are translated eagerly outside of literals, but are kept
until phase 7 for literals.

Regarding the problem that we need to retain hex escape sequences
as code units even when faced with string literal concatenation,
I've simply specified that the lexical structure of a string-literal
is retained.

Example: "\xA" "B"
(even after concatenation) consists of two lexical items:
  hexadecimal-escape-sequence and basic-s-char

R"(\u00)" "41"
consists of six lexical items after concatenation:
  r-chars backslash, letter "u", digit "0", digit "0"
  basic-s-chars digit "4" and digit "1"
(no UCN is formed)

This keeps string-literals structurally intact until we need
to make objects from them in phase 7.

Regarding the problem that transcoding to the "execution character set"
is undesirable for diagnostic messages, this is already addressed by
the status quo wording; see [lex.string] p9 and p10:

p9 "Evaluating a string-literal results in a string literal object..."

p10 "String literal objects are initialized with the sequence of code unit values..."

We only "evaluate" at runtime (and maybe at constexpr compile-time),
but we don't "evaluate" the string-literals in static_assert or [[nodiscard]],
so we don't get a string literal object for those latter cases, and thus
we don't get any transcoding. Which is good.

There is a bit of cheating to retain 7 translation phases (6 would suffice).


SG16 list run by sg16-owner@lists.isocpp.org