It is certainly contrived for the purposes of this discussion. I don't know if there are worthwhile use cases; I know I'm not in a position to claim that there are not.On Tue, May 5, 2020 at 6:18 PM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:
On 05/05/2020 23.45, Tom Honermann wrote:
> I phrased that question poorly. I meant to ask if we *want* that example to be well-formed (after correcting it for Hubert's observation that the constructed identifier is not in NFC as I initially claimed).
>
> Let's adjust it to:
>
>> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2; constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2, "whatever");|
>
> According to https://minaret.info/test/normalize.msp (thanks for that link, Hubert), the constructed identifier (again, assuming that \u0327\u0300 were to be lexed as a single preprocessor token), is in NFC.
>
> I'm happy with the degenerate "universal-character-name that is none of the above" approach, I'm just wondering if it can/should be extended to munch multiple such characters. If it can be so extended, do we have a good rationale for why we wouldn't have it do so? If it can't be so extended, what is the technical reason?
First, there is no reason why you want to write the above stuff.
An identifier containing combining marks that is, in fact, NFC
(i.e. the marks don't actually combine with the preceding character)
seems contrived.
That is a good point; agreed.
Second, it seems to depend very much on the specific combining
character and the character preceding it whether you get a
combination that is NFC (i.e. well-formed) or not. Your
example above would probably be ill-formed for accent(A), but
well-formed for accent(Z). That seems a rather random outcome.
(My opinion would change if we would allow non-NFC identifiers
throughout. But we don't, for good reason.)
Another good point; agreed.
Third, Hubert's observation was that there might be inadvertent
combinations of a combining character with something that precedes
it. Your editor might display the combination, but C++ will lex
the "something that precedes it" separately from the combining
character. That seems unfriendly and should be made ill-formed
as much as we can.
Fourth, given the NFC requirement, it seems to me that combining
marks should never appear in source code outside of string literals
at all. If you want them in your strings, go put them inside
string literals but don't disturb ## with them.
This I don't agree with. UAC#31 and the characters in the XID_Start and XID_Continue classes permit combining characters in identifiers in NFC. NFC does not eliminate the need for combining characters in many scripts; precomposed characters are not defined for many legitimate characters.
I agree with not disturbing the ## operator to explicitly allow
them.
If we were to permit this, I would find this a perfectly acceptable workaround.
Finally, it seems you could do what you wanted using something
like:
#define accent(x) x ## \u0327 ## \u0300
The wording we had made the program ill-formed right after lexing the stray combining characters (and I think that's the right thing to do).
Thank you, Hubert. I had come away from the telecon with the
(apparent) misconception that stray combining characters would not
be diagnosed until the end of translation phase 4 (and thus could
participate in concatenation).
I'm content with diagnosing them immediately after they are lexed. That suffices for us to do something different later if sufficient motivation is found.
Given my misunderstanding here, I urge careful review of my
writeup of the telecon at https://github.com/sg16-unicode/sg16-meetings#april-22nd-2020
to ensure I didn't misrepresent something.
Steve, I think this is something that is worth making more explicit in the paper. Also, I think it would be helpful to include a table in the paper that demonstrates changes in interpretation of these code examples before and after the proposed wording. Something like:
| \u0300 | two preprocessing tokens before this proposal |
one preprocessing token after |
| #define accent(x) x##\u0300 | UB before this proposal since
\u0300 is not a valid identifier and tearing of the UCN results in
the concatenation producing x\ which is not a single preprocessor
token | ill-formed after because \u0300 is a stray UCN |
Neither am I :)constexpr int accent(Z) = 2;
This produces the intermediate token Z\u0327 and the final token
Z\u0327\u0300 . I guess both are NFC, so are fine.
This seems a reasonable work-around for someone dying to do this.
> Rather than just extending "universal-character-name that is none of the above" to munch multiple characters, another approach would be to keep that (so as to avoid tearing of UCNs) and to add an additional non-terminal that matches /identifier-continue/ (but avoids ambiguity with /identifier/). I believe this would suffice to enable construction via concat of every valid identifier at arbitrary UCN boundaries. Whether that is a useful design consideration I withhold opinion on.
I'm not convinced such an approach is worth the effort.
Note it took us quite a while to arrive at the current status
in SG16.
Yes, and these questions were not intended to disrupt that
progress, but rather to ensure we had good rationale for questions
that might be asked in EWG. I'm content with these answers now.
Tom.
Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16