sg16: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 5 May 2020 01:58:35 -0400

P1949R3 <https://wg21.link/p1949r3> presents the following code that,
assuming I accurately captured the discussion during the April 22nd SG16
telecon in
https://github.com/sg16-unicode/sg16-meetings#april-22nd-2020, we intend
to make well-formed (it is currently ill-formed because \u0300 doesn't
match /identifier/ and is therefore lexed as '\' followed by 'u0300').

> |#define accent(x)x##\u0300 constexpr int accent(A) = 2; constexpr int
> gv2 = A\u0300; static_assert(gv2 == 2, "whatever");|

However, the proposed wording would reject the following case involving
multiple combining characters:

> |#define accent(x)x##\u0300\u0327 constexpr int accent(A) = 2;
> constexpr int gv2 = A\u0300\u0327; static_assert(gv2 == 2, "whatever");|

The rejection occurs because the proposed wording
<http://wiki.edg.com/pub/Wg21summer2020/SG16/uax31.html> results in each
/universal-character-name/ that is not lexed as part of one of the
existing /preprocessing-token/ cases being lexed as its own
preprocessing token; the attempted concatenation produces two
preprocessor tokens (A\u0300 and \u0327). I don't know of a principled
reason for such rejection, though it isn't clear to me what characters
should be permitted to be munched together. One approach would be to
introduce another new /preprocessing-token/ category to match the
proposed /identifier-continue/; max munch would still always prefer
/identifier/ when such a sequence is preceded by a character in
XID_Start. We would still want to retain the proposed new "each
/universal-character-name/ ..." category as a way to avoid tearing of
/universal-character-name/s that name a character not in XID_Start or
XID_Continue.

I'm not convinced that this scenario is worth addressing. It strikes me
as approximately as valuable as the first example.

Tom.

Received on 2020-05-05 01:01:36