On Tue, Jan 26, 2021 at 5:29 PM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:
Hi,

There is a desire to switch the specification of C++
to a "model B" approach as described in the C99 Rationale v5.10,
section 5.2.1.

This paper does that:

https://wiki.edg.com/pub/Wg21telecons2021/SG16/charset.html
Thank you, Jens!
 


(Yes, the intro prose needs more work.)

The new terms introduced here are:

 - translation character set (essentially Unicode)
 - basic character set (used to be "basic source character set")
 - ordinary / wide literal encoding (used to be "execution character set",
which is sub-optimal, because execution environments may vary for
a given executable)

UCNs are translated eagerly outside of literals, but are kept
until phase 7 for literals.
I am not sure we discussed stringization behaviour of tokens under this model. We no longer have the "which UCN shows up" problem. We have the "should the UCN stick around" problem.
 


Regarding the problem that we need to retain hex escape sequences
as code units even when faced with string literal concatenation,
I've simply specified that the lexical structure of a string-literal
is retained.

Example:   "\xA" "B"
(even after concatenation) consists of two lexical items:
  hexadecimal-escape-sequence  and  basic-s-char

R"(\u00)" "41"
consists of six lexical items after concatenation:
  r-chars backslash, letter "u", digit "0", digit "0"
  basic-s-chars digit "4" and digit "1"
(no UCN is formed)

This keeps string-literals structurally intact until we need
to make objects from them in phase 7.


Regarding the problem that transcoding to the "execution character set"
is undesirable for diagnostic messages, this is already addressed by
the status quo wording; see [lex.string] p9 and p10:

p9 "Evaluating a string-literal results in a string literal object..."

p10 "String literal objects are initialized with the sequence of code unit values..."

We only "evaluate" at runtime (and maybe at constexpr compile-time),
but we don't "evaluate" the string-literals in static_assert or [[nodiscard]],
so we don't get a string literal object for those latter cases, and thus
we don't get any transcoding.  Which is good.


There is a bit of cheating to retain 7 translation phases (6 would suffice).

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16