sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 5 May 2020 18:23:47 -0400

On Tue, May 5, 2020 at 6:18 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 05/05/2020 23.45, Tom Honermann wrote:
> > I phrased that question poorly. I meant to ask if we *want* that
> example to be well-formed (after correcting it for Hubert's observation
> that the constructed identifier is not in NFC as I initially claimed).
> >
> > Let's adjust it to:
> >
> >> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2;
> constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2, "whatever");|
> >
> > According to https://minaret.info/test/normalize.msp (thanks for that
> link, Hubert), the constructed identifier (again, assuming that
> \u0327\u0300 were to be lexed as a single preprocessor token), is in NFC.
> >
> > I'm happy with the degenerate "universal-character-name that is none of
> the above" approach, I'm just wondering if it can/should be extended to
> munch multiple such characters. If it can be so extended, do we have a
> good rationale for why we wouldn't have it do so? If it can't be so
> extended, what is the technical reason?
>
> First, there is no reason why you want to write the above stuff.
> An identifier containing combining marks that is, in fact, NFC
> (i.e. the marks don't actually combine with the preceding character)
> seems contrived.
>
> Second, it seems to depend very much on the specific combining
> character and the character preceding it whether you get a
> combination that is NFC (i.e. well-formed) or not. Your
> example above would probably be ill-formed for accent(A), but
> well-formed for accent(Z). That seems a rather random outcome.
> (My opinion would change if we would allow non-NFC identifiers
> throughout. But we don't, for good reason.)
>
> Third, Hubert's observation was that there might be inadvertent
> combinations of a combining character with something that precedes
> it. Your editor might display the combination, but C++ will lex
> the "something that precedes it" separately from the combining
> character. That seems unfriendly and should be made ill-formed
> as much as we can.
>
> Fourth, given the NFC requirement, it seems to me that combining
> marks should never appear in source code outside of string literals
> at all. If you want them in your strings, go put them inside
> string literals but don't disturb ## with them.
>
> Finally, it seems you could do what you wanted using something
> like:
>
> #define accent(x) x ## \u0327 ## \u0300
>
The wording we had made the program ill-formed right after lexing the stray
combining characters (and I think that's the right thing to do).

> constexpr int accent(Z) = 2;
>
> This produces the intermediate token Z\u0327 and the final token
> Z\u0327\u0300 . I guess both are NFC, so are fine.
>
> This seems a reasonable work-around for someone dying to do this.
>
> > Rather than just extending "universal-character-name that is none of the
> above" to munch multiple characters, another approach would be to keep that
> (so as to avoid tearing of UCNs) and to add an additional non-terminal that
> matches /identifier-continue/ (but avoids ambiguity with /identifier/). I
> believe this would suffice to enable construction via concat of every valid
> identifier at arbitrary UCN boundaries. Whether that is a useful design
> consideration I withhold opinion on.
>
> I'm not convinced such an approach is worth the effort.
>
> Note it took us quite a while to arrive at the current status
> in SG16.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-05-05 17:27:05