sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 6 May 2020 01:49:45 -0400

On 5/5/20 7:12 PM, Steve Downey via SG16 wrote:
> Precomposed characters are the norm only in Western alphabets, where
> the combinations are few, and there are historical mappings for round
> trip purposes. In other languages combining characters are the norm,
> and there do not exist precombined characters at least for the most part.
>
>
> Devanagari, for example, has combining marks as the norm. That we can
> have a z with umlaut, which doesn't exist in natural language, falls
> out from the general mechanism, but we don't want to preclude the
> general mechanism.
>
> However, accomplish that by token pasting is a non-goal.
>
> My silly example
> std::string operator _̈ (const char*, std::size_t);
> // convert text to heavy metal form.
>
> Has real equivalents in real languages by the rules of unicode, so it
> should be ok from the lexer's point of view.
>
> If you can't add an umlaut to _ via macro, I am not sad.

Nor am I so long as attempts to do so are ill-formed; which is what I
understand (better now) that Hubert has been pushing for.

Tom.

>
> On Tue, May 5, 2020, 18:18 Jens Maurer via SG16 <sg16_at_[hidden]
> <mailto:sg16_at_[hidden]>> wrote:
>
> On 05/05/2020 23.45, Tom Honermann wrote:
> > I phrased that question poorly. I meant to ask if we *want*
> that example to be well-formed (after correcting it for Hubert's
> observation that the constructed identifier is not in NFC as I
> initially claimed).
> >
> > Let's adjust it to:
> >
> >> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2;
> constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2,
> "whatever");|
> >
> > According to https://minaret.info/test/normalize.msp (thanks for
> that link, Hubert), the constructed identifier (again, assuming
> that \u0327\u0300 were to be lexed as a single preprocessor
> token), is in NFC.
> >
> > I'm happy with the degenerate "universal-character-name that is
> none of the above" approach, I'm just wondering if it can/should
> be extended to munch multiple such characters. If it can be so
> extended, do we have a good rationale for why we wouldn't have it
> do so? If it can't be so extended, what is the technical reason?
>
> First, there is no reason why you want to write the above stuff.
> An identifier containing combining marks that is, in fact, NFC
> (i.e. the marks don't actually combine with the preceding character)
> seems contrived.
>
> Second, it seems to depend very much on the specific combining
> character and the character preceding it whether you get a
> combination that is NFC (i.e. well-formed) or not. Your
> example above would probably be ill-formed for accent(A), but
> well-formed for accent(Z). That seems a rather random outcome.
> (My opinion would change if we would allow non-NFC identifiers
> throughout. But we don't, for good reason.)
>
> Third, Hubert's observation was that there might be inadvertent
> combinations of a combining character with something that precedes
> it. Your editor might display the combination, but C++ will lex
> the "something that precedes it" separately from the combining
> character. That seems unfriendly and should be made ill-formed
> as much as we can.
>
> Fourth, given the NFC requirement, it seems to me that combining
> marks should never appear in source code outside of string literals
> at all. If you want them in your strings, go put them inside
> string literals but don't disturb ## with them.
>
> Finally, it seems you could do what you wanted using something
> like:
>
> #define accent(x) x ## \u0327 ## \u0300
> constexpr int accent(Z) = 2;
>
> This produces the intermediate token Z\u0327 and the final token
> Z\u0327\u0300 . I guess both are NFC, so are fine.
>
> This seems a reasonable work-around for someone dying to do this.
>
> > Rather than just extending "universal-character-name that is
> none of the above" to munch multiple characters, another approach
> would be to keep that (so as to avoid tearing of UCNs) and to add
> an additional non-terminal that matches /identifier-continue/ (but
> avoids ambiguity with /identifier/). I believe this would suffice
> to enable construction via concat of every valid identifier at
> arbitrary UCN boundaries. Whether that is a useful design
> consideration I withhold opinion on.
>
> I'm not convinced such an approach is worth the effort.
>
> Note it took us quite a while to arrive at the current status
> in SG16.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2020-05-06 00:52:47