sg16: Re: [SG16] Unicode as the basic compiler character set

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 26 Jan 2021 22:53:28 -0500

On Tue, Jan 26, 2021 at 5:29 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> Hi,
>
> There is a desire to switch the specification of C++
> to a "model B" approach as described in the C99 Rationale v5.10,
> section 5.2.1.
>
> This paper does that:
>
> https://wiki.edg.com/pub/Wg21telecons2021/SG16/charset.html

Thank you, Jens!

>
>
> (Yes, the intro prose needs more work.)
>
> The new terms introduced here are:
>
> - translation character set (essentially Unicode)
> - basic character set (used to be "basic source character set")
> - ordinary / wide literal encoding (used to be "execution character set",
> which is sub-optimal, because execution environments may vary for
> a given executable)
>
> UCNs are translated eagerly outside of literals, but are kept
> until phase 7 for literals.
>
I am not sure we discussed stringization behaviour of tokens under this
model. We no longer have the "which UCN shows up" problem. We have the
"should the UCN stick around" problem.

>
>
> Regarding the problem that we need to retain hex escape sequences
> as code units even when faced with string literal concatenation,
> I've simply specified that the lexical structure of a string-literal
> is retained.
>
> Example: "\xA" "B"
> (even after concatenation) consists of two lexical items:
> hexadecimal-escape-sequence and basic-s-char
>
> R"(\u00)" "41"
> consists of six lexical items after concatenation:
> r-chars backslash, letter "u", digit "0", digit "0"
> basic-s-chars digit "4" and digit "1"
> (no UCN is formed)
>
> This keeps string-literals structurally intact until we need
> to make objects from them in phase 7.
>
>
> Regarding the problem that transcoding to the "execution character set"
> is undesirable for diagnostic messages, this is already addressed by
> the status quo wording; see [lex.string] p9 and p10:
>
> p9 "Evaluating a string-literal results in a string literal object..."
>
> p10 "String literal objects are initialized with the sequence of code unit
> values..."
>
> We only "evaluate" at runtime (and maybe at constexpr compile-time),
> but we don't "evaluate" the string-literals in static_assert or
> [[nodiscard]],
> so we don't get a string literal object for those latter cases, and thus
> we don't get any transcoding. Which is good.
>
>
> There is a bit of cheating to retain 7 translation phases (6 would
> suffice).
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-01-26 21:54:02