C++ Logo

SG16

Advanced search

Subject: Re: Unicode as the basic compiler character set
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-01-27 02:20:30


On 27/01/2021 04.53, Hubert Tong wrote:
> On Tue, Jan 26, 2021 at 5:29 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:

> UCNs are translated eagerly outside of literals, but are kept
> until phase 7 for literals.
>
> I am not sure we discussed stringization behaviour of tokens under this model. We no longer have the "which UCN shows up" problem. We have the "should the UCN stick around" problem.

Reading [cpp.stringize], it seems the expected stringization
of the source-code token

  name\u4242

would be

  "name\\4242"

and that would be the same if instead the actual Unicode character
for U+4242 appeared in the source code under the C++20 rules
(because of early replacement of U+4242 with ASCII-only \u4242).

With this paper, the behavior will definitely change in that
the latter situation will yield "name<U+4242>".
With the status quo of the paper, even the former case will turn
into that string, but we can prevent that by keeping UCNs outside
string-literals a little longer (until phase 5).

The whole scenario feels like it violates the principle that the
chosen model (as per the C rationale) shouldn't matter.

Alternatively, we could replace all UCNs eagerly (forming UCNs during
## token pasting is "undefined behavior" anyway). However, that would
cause

"\xA\u0041"

to become

"\xAA"

which differs from the status quo.

Jens

> Regarding the problem that we need to retain hex escape sequences
> as code units even when faced with string literal concatenation,
> I've simply specified that the lexical structure of a string-literal
> is retained.
>
> Example:   "\xA" "B"
> (even after concatenation) consists of two lexical items:
>   hexadecimal-escape-sequence  and  basic-s-char
>
> R"(\u00)" "41"
> consists of six lexical items after concatenation:
>   r-chars backslash, letter "u", digit "0", digit "0"
>   basic-s-chars digit "4" and digit "1"
> (no UCN is formed)
>
> This keeps string-literals structurally intact until we need
> to make objects from them in phase 7.
>
>
> Regarding the problem that transcoding to the "execution character set"
> is undesirable for diagnostic messages, this is already addressed by
> the status quo wording; see [lex.string] p9 and p10:
>
> p9 "Evaluating a string-literal results in a string literal object..."
>
> p10 "String literal objects are initialized with the sequence of code unit values..."
>
> We only "evaluate" at runtime (and maybe at constexpr compile-time),
> but we don't "evaluate" the string-literals in static_assert or [[nodiscard]],
> so we don't get a string literal object for those latter cases, and thus
> we don't get any transcoding.  Which is good.
>
>
> There is a bit of cheating to retain 7 translation phases (6 would suffice).
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>


SG16 list run by sg16-owner@lists.isocpp.org