sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 5 May 2020 21:46:30 +0200

On 05/05/2020 21.17, Tom Honermann wrote:
> On 5/5/20 2:36 AM, Jens Maurer via SG16 wrote:
>> On 05/05/2020 08.34, Jens Maurer via SG16 wrote:
>>> On 05/05/2020 08.15, Jens Maurer via SG16 wrote:
>>>> On 05/05/2020 07.58, Tom Honermann via SG16 wrote:
>>>>> P1949R3 <https://wg21.link/p1949r3> presents the following code that, assuming I accurately captured the discussion during the April 22nd SG16 telecon in https://github.com/sg16-unicode/sg16-meetings#april-22nd-2020, we intend to make well-formed (it is currently ill-formed because \u0300 doesn't match /identifier/ and is therefore lexed as '\' followed by 'u0300').
>>>>>
>>>>>> |#define accent(x)x##\u0300 constexpr int accent(A) = 2; constexpr int gv2 = A\u0300; static_assert(gv2 == 2, "whatever");|
>>>> (Did I mention I hate HTML e-mails?)
> Especially when presentation like that affects the semantics of the text :(

>>>> The proposed wording does not attempt to make this example well-formed,
>>>> assuming that a combining character is not in XID_Continue.
>>>> (Please check me on the latter.)
>>> That's wrong. UAX#31 says XID_Continue is ID_Continue with
>>> a few (enumerated) exceptions.
>>> https://www.unicode.org/reports/tr31/tr31-33.html#NFKC_Modifications

> That section discusses NFKC, not NFC. I don't think it is applicable to our intents.

Right, sorry. Try this section:

https://www.unicode.org/reports/tr31/tr31-33.html#Default_Identifier_Syntax

>> ... and ID_Continue includes combining characters.
> It would be rather restrictive if it didn't :)

Yeah.

>>>> When we preprocess accent(A),
>>>> we perform A ## \u0300
>>>> which becomes A\u0300
>>>> which is not a (single) preprocessing token
>>>> (because \u0300 is not in XID_Continue, so this is not an identifier,
>>>> and none of the other kinds in [lex.pptoken] matches)
>>> So, this argument must become "which is lexed as
>>> an /identifier/ preprocessing token, and then immediately
>>> rejected because
>>> "The program is ill-formed if an identifier does not conform
>>> to the NFC normalization specified in ISO/IEC 10646."
>
> Agreed for that example. But for the other example I provided, the resulting identifier (if lexed such that|\u0300\u0327| produces a single preprocessor token) is in NFC since there is no precomposed character for a capital letter A with grave and cedilla. Do we believe that that example should be well-formed?

No.

> #define accent(x) x##\u0300\u0327
>
> constexpr int accent(A) = 2;
> constexpr int gv2 = A\u0300\u0327;
> static_assert(gv2 == 2, "whatever");

Since \u0300 is in XID_Continue, but not in XID_Start,

x##\u0300\u0327 is actually

x <token separator> ## <token separator> \u0300 <token separator> \u0327

because we can lex \u0300 only as a degenerate preprocessing
token ("universal-character-name that is none of the above")
In particular, \u0300\u0327 is not a single preprocessing token.
(It's certainly not an /identifier/, and what else would it be?)

>>> And since UAX#31 actually specifies XID_Continue and
>>> ID_Continue, not UAX#44, we need to make our reference
>>> to UAX#31 normative and the reference to UAX#44
>>> informative (bibliography).
>>>
>>> Also, the normative text should refer to UAX#31, not UAX#44.
>
> I don't think that is correct. I believe UAX#44 does define the XID_Start and XID_Continue properties; UAX#31 provides some informational context for why they are defined as they are.

I thought so too, but then I started reading in detail.

https://www.unicode.org/reports/tr31/tr31-33.html#Default_Identifier_Syntax

has a table 2 which defines ID_Start and ID_Continue,
as well as
XID_Start XID_Start characters are derived from ID_Start as per Section 5.1, NFKC Modifications.
XID_Continue characters are derived from ID_Continue as per Section 5.1, NFKC Modifications.

In contrast, UAX#44 says

https://www.unicode.org/reports/tr44/tr44-26.html#DerivedCoreProperties.txt
"Used to determine programming identifiers, as described in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31]."

which is a bit short on definition.

The actual list of character ranges is in
https://www.unicode.org/Public/13.0.0/ucd/DerivedCoreProperties.txt
so maybe we should refer to that file instead.

Jens

> Tom.
>
>>> Jens
>>>
>>>
>>>
>>>> and we get undefined behavior per [cpp.concat] p3.
>>>>
>>>> We decided not to address the undefined behavior case here,
>>>> because that's SG12 territory.
>>>>
>>>> Jens
>>>>
>>>>
>>>>> However, the proposed wording would reject the following case involving multiple combining characters:
>>>>>
>>>>>> |#define accent(x)x##\u0300\u0327 constexpr int accent(A) = 2; constexpr int gv2 = A\u0300\u0327; static_assert(gv2 == 2, "whatever");|
>>>>> The rejection occurs because the proposed wording <http://wiki.edg.com/pub/Wg21summer2020/SG16/uax31.html> results in each /universal-character-name/ that is not lexed as part of one of the existing /preprocessing-token/ cases being lexed as its own preprocessing token; the attempted concatenation produces two preprocessor tokens (A\u0300 and \u0327). I don't know of a principled reason for such rejection, though it isn't clear to me what characters should be permitted to be munched together. One approach would be to introduce another new /preprocessing-token/ category to match the proposed /identifier-continue/; max munch would still always prefer /identifier/ when such a sequence is preceded by a character in XID_Start. We would still want to retain the proposed new "each /universal-character-name/ ..." category as a way to avoid tearing of /universal-character-name/s that name a character not in XID_Start or XID_Continue.
>>>>>
>>>>> I'm not convinced that this scenario is worth addressing. It strikes me as approximately as valuable as the first example.
>>>>>
>>>>> Tom.
>>>>>
>>>>>
>

Received on 2020-05-05 14:49:37