sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 6 May 2020 00:18:37 +0200

On 05/05/2020 23.45, Tom Honermann wrote:
> I phrased that question poorly. I meant to ask if we *want* that example to be well-formed (after correcting it for Hubert's observation that the constructed identifier is not in NFC as I initially claimed).
>
> Let's adjust it to:
>
>> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2; constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2, "whatever");|
>
> According to https://minaret.info/test/normalize.msp (thanks for that link, Hubert), the constructed identifier (again, assuming that \u0327\u0300 were to be lexed as a single preprocessor token), is in NFC.
>
> I'm happy with the degenerate "universal-character-name that is none of the above" approach, I'm just wondering if it can/should be extended to munch multiple such characters. If it can be so extended, do we have a good rationale for why we wouldn't have it do so? If it can't be so extended, what is the technical reason?

First, there is no reason why you want to write the above stuff.
An identifier containing combining marks that is, in fact, NFC
(i.e. the marks don't actually combine with the preceding character)
seems contrived.

Second, it seems to depend very much on the specific combining
character and the character preceding it whether you get a
combination that is NFC (i.e. well-formed) or not. Your
example above would probably be ill-formed for accent(A), but
well-formed for accent(Z). That seems a rather random outcome.
(My opinion would change if we would allow non-NFC identifiers
throughout. But we don't, for good reason.)

Third, Hubert's observation was that there might be inadvertent
combinations of a combining character with something that precedes
it. Your editor might display the combination, but C++ will lex
the "something that precedes it" separately from the combining
character. That seems unfriendly and should be made ill-formed
as much as we can.

Fourth, given the NFC requirement, it seems to me that combining
marks should never appear in source code outside of string literals
at all. If you want them in your strings, go put them inside
string literals but don't disturb ## with them.

Finally, it seems you could do what you wanted using something
like:

#define accent(x) x ## \u0327 ## \u0300
constexpr int accent(Z) = 2;

This produces the intermediate token Z\u0327 and the final token
Z\u0327\u0300 . I guess both are NFC, so are fine.

This seems a reasonable work-around for someone dying to do this.

> Rather than just extending "universal-character-name that is none of the above" to munch multiple characters, another approach would be to keep that (so as to avoid tearing of UCNs) and to add an additional non-terminal that matches /identifier-continue/ (but avoids ambiguity with /identifier/). I believe this would suffice to enable construction via concat of every valid identifier at arbitrary UCN boundaries. Whether that is a useful design consideration I withhold opinion on.

I'm not convinced such an approach is worth the effort.

Note it took us quite a while to arrive at the current status
in SG16.

Jens

Received on 2020-05-05 17:21:42