sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 5 May 2020 08:15:41 +0200

On 05/05/2020 07.58, Tom Honermann via SG16 wrote:
> P1949R3 <https://wg21.link/p1949r3> presents the following code that, assuming I accurately captured the discussion during the April 22nd SG16 telecon in https://github.com/sg16-unicode/sg16-meetings#april-22nd-2020, we intend to make well-formed (it is currently ill-formed because \u0300 doesn't match /identifier/ and is therefore lexed as '\' followed by 'u0300').
>
>> |#define accent(x)x##\u0300 constexpr int accent(A) = 2; constexpr int gv2 = A\u0300; static_assert(gv2 == 2, "whatever");|

(Did I mention I hate HTML e-mails?)

The proposed wording does not attempt to make this example well-formed,
assuming that a combining character is not in XID_Continue.
(Please check me on the latter.)

When we preprocess accent(A),
we perform A ## \u0300
which becomes A\u0300
which is not a (single) preprocessing token
(because \u0300 is not in XID_Continue, so this is not an identifier,
and none of the other kinds in [lex.pptoken] matches)
and we get undefined behavior per [cpp.concat] p3.

We decided not to address the undefined behavior case here,
because that's SG12 territory.

Jens

> However, the proposed wording would reject the following case involving multiple combining characters:
>
>> |#define accent(x)x##\u0300\u0327 constexpr int accent(A) = 2; constexpr int gv2 = A\u0300\u0327; static_assert(gv2 == 2, "whatever");|
>
> The rejection occurs because the proposed wording <http://wiki.edg.com/pub/Wg21summer2020/SG16/uax31.html> results in each /universal-character-name/ that is not lexed as part of one of the existing /preprocessing-token/ cases being lexed as its own preprocessing token; the attempted concatenation produces two preprocessor tokens (A\u0300 and \u0327). I don't know of a principled reason for such rejection, though it isn't clear to me what characters should be permitted to be munched together. One approach would be to introduce another new /preprocessing-token/ category to match the proposed /identifier-continue/; max munch would still always prefer /identifier/ when such a sequence is preceded by a character in XID_Start. We would still want to retain the proposed new "each /universal-character-name/ ..." category as a way to avoid tearing of /universal-character-name/s that name a character not in XID_Start or XID_Continue.
>
> I'm not convinced that this scenario is worth addressing. It strikes me as approximately as valuable as the first example.
>
> Tom.
>
>

Received on 2020-05-05 01:18:45