sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 5 May 2020 17:45:58 -0400

On 5/5/20 3:46 PM, Jens Maurer via SG16 wrote:
> On 05/05/2020 21.17, Tom Honermann wrote:
>> On 5/5/20 2:36 AM, Jens Maurer via SG16 wrote:
>>> On 05/05/2020 08.34, Jens Maurer via SG16 wrote:
>>>> On 05/05/2020 08.15, Jens Maurer via SG16 wrote:
>>>>> On 05/05/2020 07.58, Tom Honermann via SG16 wrote:
>>>>>> P1949R3 <https://wg21.link/p1949r3> presents the following code that, assuming I accurately captured the discussion during the April 22nd SG16 telecon in https://github.com/sg16-unicode/sg16-meetings#april-22nd-2020, we intend to make well-formed (it is currently ill-formed because \u0300 doesn't match /identifier/ and is therefore lexed as '\' followed by 'u0300').
>>>>>>
>>>>>>> |#define accent(x)x##\u0300 constexpr int accent(A) = 2; constexpr int gv2 = A\u0300; static_assert(gv2 == 2, "whatever");|
>>>>> (Did I mention I hate HTML e-mails?)
>> Especially when presentation like that affects the semantics of the text :(
>>>>> The proposed wording does not attempt to make this example well-formed,
>>>>> assuming that a combining character is not in XID_Continue.
>>>>> (Please check me on the latter.)
>>>> That's wrong. UAX#31 says XID_Continue is ID_Continue with
>>>> a few (enumerated) exceptions.
>>>> https://www.unicode.org/reports/tr31/tr31-33.html#NFKC_Modifications
>> That section discusses NFKC, not NFC. I don't think it is applicable to our intents.
> Right, sorry. Try this section:
>
> https://www.unicode.org/reports/tr31/tr31-33.html#Default_Identifier_Syntax
>
>>> ... and ID_Continue includes combining characters.
>> It would be rather restrictive if it didn't :)
> Yeah.
>
>>>>> When we preprocess accent(A),
>>>>> we perform A ## \u0300
>>>>> which becomes A\u0300
>>>>> which is not a (single) preprocessing token
>>>>> (because \u0300 is not in XID_Continue, so this is not an identifier,
>>>>> and none of the other kinds in [lex.pptoken] matches)
>>>> So, this argument must become "which is lexed as
>>>> an /identifier/ preprocessing token, and then immediately
>>>> rejected because
>>>> "The program is ill-formed if an identifier does not conform
>>>> to the NFC normalization specified in ISO/IEC 10646."
>> Agreed for that example. But for the other example I provided, the resulting identifier (if lexed such that|\u0300\u0327| produces a single preprocessor token) is in NFC since there is no precomposed character for a capital letter A with grave and cedilla. Do we believe that that example should be well-formed?
> No.
>
>> #define accent(x) x##\u0300\u0327
>>
>> constexpr int accent(A) = 2;
>> constexpr int gv2 = A\u0300\u0327;
>> static_assert(gv2 == 2, "whatever");
> Since \u0300 is in XID_Continue, but not in XID_Start,
>
> x##\u0300\u0327 is actually
>
> x <token separator> ## <token separator> \u0300 <token separator> \u0327
>
> because we can lex \u0300 only as a degenerate preprocessing
> token ("universal-character-name that is none of the above")
> In particular, \u0300\u0327 is not a single preprocessing token.
> (It's certainly not an /identifier/, and what else would it be?)

I phrased that question poorly. I meant to ask if we *want* that
example to be well-formed (after correcting it for Hubert's observation
that the constructed identifier is not in NFC as I initially claimed).

Let's adjust it to:

> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2;
> constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2, "whatever");|

According to https://minaret.info/test/normalize.msp (thanks for that
link, Hubert), the constructed identifier (again, assuming that
\u0327\u0300 were to be lexed as a single preprocessor token), is in NFC.

I'm happy with the degenerate "universal-character-name that is none of
the above" approach, I'm just wondering if it can/should be extended to
munch multiple such characters. If it can be so extended, do we have a
good rationale for why we wouldn't have it do so? If it can't be so
extended, what is the technical reason?

Rather than just extending "universal-character-name that is none of the
above" to munch multiple characters, another approach would be to keep
that (so as to avoid tearing of UCNs) and to add an additional
non-terminal that matches /identifier-continue/ (but avoids ambiguity
with /identifier/). I believe this would suffice to enable construction
via concat of every valid identifier at arbitrary UCN boundaries.
Whether that is a useful design consideration I withhold opinion on.

>
>>>> And since UAX#31 actually specifies XID_Continue and
>>>> ID_Continue, not UAX#44, we need to make our reference
>>>> to UAX#31 normative and the reference to UAX#44
>>>> informative (bibliography).
>>>>
>>>> Also, the normative text should refer to UAX#31, not UAX#44.
>> I don't think that is correct. I believe UAX#44 does define the XID_Start and XID_Continue properties; UAX#31 provides some informational context for why they are defined as they are.
> I thought so too, but then I started reading in detail.
>
> https://www.unicode.org/reports/tr31/tr31-33.html#Default_Identifier_Syntax
>
> has a table 2 which defines ID_Start and ID_Continue,
> as well as
> XID_Start XID_Start characters are derived from ID_Start as per Section 5.1, NFKC Modifications.
> XID_Continue characters are derived from ID_Continue as per Section 5.1, NFKC Modifications.
>
> In contrast, UAX#44 says
>
> https://www.unicode.org/reports/tr44/tr44-26.html#DerivedCoreProperties.txt
> "Used to determine programming identifiers, as described in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31]."
>
> which is a bit short on definition.
>
> The actual list of character ranges is in
> https://www.unicode.org/Public/13.0.0/ucd/DerivedCoreProperties.txt
> so maybe we should refer to that file instead.

That sounds right to me.

Tom.

>
> Jens
>
>
>
>> Tom.
>>
>>>> Jens
>>>>
>>>>
>>>>
>>>>> and we get undefined behavior per [cpp.concat] p3.
>>>>>
>>>>> We decided not to address the undefined behavior case here,
>>>>> because that's SG12 territory.
>>>>>
>>>>> Jens
>>>>>
>>>>>
>>>>>> However, the proposed wording would reject the following case involving multiple combining characters:
>>>>>>
>>>>>>> |#define accent(x)x##\u0300\u0327 constexpr int accent(A) = 2; constexpr int gv2 = A\u0300\u0327; static_assert(gv2 == 2, "whatever");|
>>>>>> The rejection occurs because the proposed wording <http://wiki.edg.com/pub/Wg21summer2020/SG16/uax31.html> results in each /universal-character-name/ that is not lexed as part of one of the existing /preprocessing-token/ cases being lexed as its own preprocessing token; the attempted concatenation produces two preprocessor tokens (A\u0300 and \u0327). I don't know of a principled reason for such rejection, though it isn't clear to me what characters should be permitted to be munched together. One approach would be to introduce another new /preprocessing-token/ category to match the proposed /identifier-continue/; max munch would still always prefer /identifier/ when such a sequence is preceded by a character in XID_Start. We would still want to retain the proposed new "each /universal-character-name/ ..." category as a way to avoid tearing of /universal-character-name/s that name a character not in XID_Start or XID_Continue.
>>>>>>
>>>>>> I'm not convinced that this scenario is worth addressing. It strikes me as approximately as valuable as the first example.
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>>

Received on 2020-05-05 16:49:00