sg16: Re: [SG16] Multiple combining characters and P1949R3: C++ Identifier Syntax using Unicode Standard Annex 31

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 6 May 2020 01:45:12 -0400

On 5/5/20 6:23 PM, Hubert Tong via SG16 wrote:
> On Tue, May 5, 2020 at 6:18 PM Jens Maurer via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 05/05/2020 23.45, Tom Honermann wrote:
> > I phrased that question poorly. I meant to ask if we *want*
> that example to be well-formed (after correcting it for Hubert's
> observation that the constructed identifier is not in NFC as I
> initially claimed).
> >
> > Let's adjust it to:
> >
> >> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2;
> constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2,
> "whatever");|
> >
> > According to https://minaret.info/test/normalize.msp (thanks for
> that link, Hubert), the constructed identifier (again, assuming
> that \u0327\u0300 were to be lexed as a single preprocessor
> token), is in NFC.
> >
> > I'm happy with the degenerate "universal-character-name that is
> none of the above" approach, I'm just wondering if it can/should
> be extended to munch multiple such characters. If it can be so
> extended, do we have a good rationale for why we wouldn't have it
> do so? If it can't be so extended, what is the technical reason?
>
> First, there is no reason why you want to write the above stuff.
> An identifier containing combining marks that is, in fact, NFC
> (i.e. the marks don't actually combine with the preceding character)
> seems contrived.
>
It is certainly contrived for the purposes of this discussion. I don't
know if there are worthwhile use cases; I know I'm not in a position to
claim that there are not.
>
>
> Second, it seems to depend very much on the specific combining
> character and the character preceding it whether you get a
> combination that is NFC (i.e. well-formed) or not. Your
> example above would probably be ill-formed for accent(A), but
> well-formed for accent(Z). That seems a rather random outcome.
> (My opinion would change if we would allow non-NFC identifiers
> throughout. But we don't, for good reason.)
>
That is a good point; agreed.
>
>
> Third, Hubert's observation was that there might be inadvertent
> combinations of a combining character with something that precedes
> it. Your editor might display the combination, but C++ will lex
> the "something that precedes it" separately from the combining
> character. That seems unfriendly and should be made ill-formed
> as much as we can.
>
Another good point; agreed.
>
>
> Fourth, given the NFC requirement, it seems to me that combining
> marks should never appear in source code outside of string literals
> at all. If you want them in your strings, go put them inside
> string literals but don't disturb ## with them.
>
This I don't agree with. UAC#31 and the characters in the XID_Start and
XID_Continue classes permit combining characters in identifiers in NFC.
NFC does not eliminate the need for combining characters in many
scripts; precomposed characters are not defined for many legitimate
characters.

I agree with not disturbing the ## operator to explicitly allow them.

>
> Finally, it seems you could do what you wanted using something
> like:
>
> #define accent(x) x ## \u0327 ## \u0300
>
If we were to permit this, I would find this a perfectly acceptable
workaround.
> The wording we had made the program ill-formed right after lexing the
> stray combining characters (and I think that's the right thing to do).

Thank you, Hubert. I had come away from the telecon with the (apparent)
misconception that stray combining characters would not be diagnosed
until the end of translation phase 4 (and thus could participate in
concatenation).

I'm content with diagnosing them immediately after they are lexed. That
suffices for us to do something different later if sufficient motivation
is found.

Given my misunderstanding here, I urge careful review of my writeup of
the telecon at
https://github.com/sg16-unicode/sg16-meetings#april-22nd-2020 to ensure
I didn't misrepresent something.

Steve, I think this is something that is worth making more explicit in
the paper. Also, I think it would be helpful to include a table in the
paper that demonstrates changes in interpretation of these code examples
before and after the proposed wording. Something like:

| \u0300 | two preprocessing tokens before this proposal | one
preprocessing token after |

| #define accent(x) x##\u0300 | UB before this proposal since \u0300
is not a valid identifier and tearing of the UCN results in the
concatenation producing x\ which is not a single preprocessor token |
ill-formed after because \u0300 is a stray UCN |

> constexpr int accent(Z) = 2;
>
> This produces the intermediate token Z\u0327 and the final token
> Z\u0327\u0300 . I guess both are NFC, so are fine.
>
> This seems a reasonable work-around for someone dying to do this.
>
> > Rather than just extending "universal-character-name that is
> none of the above" to munch multiple characters, another approach
> would be to keep that (so as to avoid tearing of UCNs) and to add
> an additional non-terminal that matches /identifier-continue/ (but
> avoids ambiguity with /identifier/). I believe this would suffice
> to enable construction via concat of every valid identifier at
> arbitrary UCN boundaries. Whether that is a useful design
> consideration I withhold opinion on.
>
> I'm not convinced such an approach is worth the effort.
>
Neither am I :)
>
>
> Note it took us quite a while to arrive at the current status
> in SG16.
>
Yes, and these questions were not intended to disrupt that progress, but
rather to ensure we had good rationale for questions that might be asked
in EWG. I'm content with these answers now.

Tom.

>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2020-05-06 00:48:15