sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 5 May 2020 19:12:09 -0400

Precomposed characters are the norm only in Western alphabets, where the
combinations are few, and there are historical mappings for round trip
purposes. In other languages combining characters are the norm, and there
do not exist precombined characters at least for the most part.

Devanagari, for example, has combining marks as the norm. That we can have
a z with umlaut, which doesn't exist in natural language, falls out from
the general mechanism, but we don't want to preclude the general mechanism.

However, accomplish that by token pasting is a non-goal.

My silly example
std::string operator _̈ (const char*, std::size_t);
// convert text to heavy metal form.

Has real equivalents in real languages by the rules of unicode, so it
should be ok from the lexer's point of view.

If you can't add an umlaut to _ via macro, I am not sad.

On Tue, May 5, 2020, 18:18 Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 05/05/2020 23.45, Tom Honermann wrote:
> > I phrased that question poorly. I meant to ask if we *want* that
> example to be well-formed (after correcting it for Hubert's observation
> that the constructed identifier is not in NFC as I initially claimed).
> >
> > Let's adjust it to:
> >
> >> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2;
> constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2, "whatever");|
> >
> > According to https://minaret.info/test/normalize.msp (thanks for that
> link, Hubert), the constructed identifier (again, assuming that
> \u0327\u0300 were to be lexed as a single preprocessor token), is in NFC.
> >
> > I'm happy with the degenerate "universal-character-name that is none of
> the above" approach, I'm just wondering if it can/should be extended to
> munch multiple such characters. If it can be so extended, do we have a
> good rationale for why we wouldn't have it do so? If it can't be so
> extended, what is the technical reason?
>
> First, there is no reason why you want to write the above stuff.
> An identifier containing combining marks that is, in fact, NFC
> (i.e. the marks don't actually combine with the preceding character)
> seems contrived.
>
> Second, it seems to depend very much on the specific combining
> character and the character preceding it whether you get a
> combination that is NFC (i.e. well-formed) or not. Your
> example above would probably be ill-formed for accent(A), but
> well-formed for accent(Z). That seems a rather random outcome.
> (My opinion would change if we would allow non-NFC identifiers
> throughout. But we don't, for good reason.)
>
> Third, Hubert's observation was that there might be inadvertent
> combinations of a combining character with something that precedes
> it. Your editor might display the combination, but C++ will lex
> the "something that precedes it" separately from the combining
> character. That seems unfriendly and should be made ill-formed
> as much as we can.
>
> Fourth, given the NFC requirement, it seems to me that combining
> marks should never appear in source code outside of string literals
> at all. If you want them in your strings, go put them inside
> string literals but don't disturb ## with them.
>
> Finally, it seems you could do what you wanted using something
> like:
>
> #define accent(x) x ## \u0327 ## \u0300
> constexpr int accent(Z) = 2;
>
> This produces the intermediate token Z\u0327 and the final token
> Z\u0327\u0300 . I guess both are NFC, so are fine.
>
> This seems a reasonable work-around for someone dying to do this.
>
> > Rather than just extending "universal-character-name that is none of the
> above" to munch multiple characters, another approach would be to keep that
> (so as to avoid tearing of UCNs) and to add an additional non-terminal that
> matches /identifier-continue/ (but avoids ambiguity with /identifier/). I
> believe this would suffice to enable construction via concat of every valid
> identifier at arbitrary UCN boundaries. Whether that is a useful design
> consideration I withhold opinion on.
>
> I'm not convinced such an approach is worth the effort.
>
> Note it took us quite a while to arrive at the current status
> in SG16.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-05-05 18:15:24