sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 6 May 2020 09:09:44 +0200

On Wed, 6 May 2020 at 07:49, Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> On 5/5/20 7:12 PM, Steve Downey via SG16 wrote:
>
> Precomposed characters are the norm only in Western alphabets, where the
> combinations are few, and there are historical mappings for round trip
> purposes. In other languages combining characters are the norm, and there
> do not exist precombined characters at least for the most part.
>
>
> Devanagari, for example, has combining marks as the norm. That we can have
> a z with umlaut, which doesn't exist in natural language, falls out from
> the general mechanism, but we don't want to preclude the general mechanism.
>
> However, accomplish that by token pasting is a non-goal.
>
> My silly example
> std::string operator _̈ (const char*, std::size_t);
> // convert text to heavy metal form.
>
> Has real equivalents in real languages by the rules of unicode, so it
> should be ok from the lexer's point of view.
>
> If you can't add an umlaut to _ via macro, I am not sad.
>
> Nor am I so long as attempts to do so are ill-formed; which is what I
> understand (better now) that Hubert has been pushing for.
>
> Tom.
>
>
> On Tue, May 5, 2020, 18:18 Jens Maurer via SG16 <sg16_at_[hidden]>
> wrote:
>
>> On 05/05/2020 23.45, Tom Honermann wrote:
>> > I phrased that question poorly. I meant to ask if we *want* that
>> example to be well-formed (after correcting it for Hubert's observation
>> that the constructed identifier is not in NFC as I initially claimed).
>> >
>> > Let's adjust it to:
>> >
>> >> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2;
>> constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2, "whatever");|
>> >
>> > According to https://minaret.info/test/normalize.msp (thanks for that
>> link, Hubert), the constructed identifier (again, assuming that
>> \u0327\u0300 were to be lexed as a single preprocessor token), is in NFC.
>> >
>> > I'm happy with the degenerate "universal-character-name that is none of
>> the above" approach, I'm just wondering if it can/should be extended to
>> munch multiple such characters. If it can be so extended, do we have a
>> good rationale for why we wouldn't have it do so? If it can't be so
>> extended, what is the technical reason?
>>
>> First, there is no reason why you want to write the above stuff.
>> An identifier containing combining marks that is, in fact, NFC
>> (i.e. the marks don't actually combine with the preceding character)
>> seems contrived.
>>
>> Second, it seems to depend very much on the specific combining
>> character and the character preceding it whether you get a
>> combination that is NFC (i.e. well-formed) or not. Your
>> example above would probably be ill-formed for accent(A), but
>> well-formed for accent(Z). That seems a rather random outcome.
>> (My opinion would change if we would allow non-NFC identifiers
>> throughout. But we don't, for good reason.)
>>
>> Third, Hubert's observation was that there might be inadvertent
>> combinations of a combining character with something that precedes
>> it. Your editor might display the combination, but C++ will lex
>> the "something that precedes it" separately from the combining
>> character. That seems unfriendly and should be made ill-formed
>> as much as we can.
>>
>> Fourth, given the NFC requirement, it seems to me that combining
>> marks should never appear in source code outside of string literals
>> at all. If you want them in your strings, go put them inside
>> string literals but don't disturb ## with them.
>>
>
More generally, concatenating 2 NFC sequences is not guaranteed to result
in an NFC sequence.[1]
Maybe NFC verification should be done on C++ tokens, not preprocessor
token (because then we would have to check twice) ?
But I question whether spending so much time on these contrived examples is
a valuable use of anyone's time.

As such, making
#define accent(x) x##\uxxxx

ill-formed is a course of action that I think should be entertained

(Afaict, concatenating 2 valid identifiers results in a valid identifier in
all cases)

[1] http://unicode.org/reports/tr15/#Concatenation

Corentin

>
>> Finally, it seems you could do what you wanted using something
>> like:
>>
>> #define accent(x) x ## \u0327 ## \u0300
>> constexpr int accent(Z) = 2;
>>
>> This produces the intermediate token Z\u0327 and the final token
>> Z\u0327\u0300 . I guess both are NFC, so are fine.
>>
>> This seems a reasonable work-around for someone dying to do this.
>>
>> > Rather than just extending "universal-character-name that is none of
>> the above" to munch multiple characters, another approach would be to keep
>> that (so as to avoid tearing of UCNs) and to add an additional non-terminal
>> that matches /identifier-continue/ (but avoids ambiguity with
>> /identifier/). I believe this would suffice to enable construction via
>> concat of every valid identifier at arbitrary UCN boundaries. Whether that
>> is a useful design consideration I withhold opinion on.
>>
>> I'm not convinced such an approach is worth the effort.
>>
>> Note it took us quite a while to arrive at the current status
>> in SG16.
>>
>> Jens
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-05-06 02:13:51