On 6/25/26 11:51, Corentin via SG16 wrote:
> We say that
>
> If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in the basic character set matches the last category, the program is ill-formed.
>
>
> But no one implements that restriction https://compiler-explorer.com/z/fe9jWcE1G <https://compiler-explorer.com/z/fe9jWcE1G>
>
> In non-pedantic mode, GCC accepts virtually anything as an identifier, and in pedantic mode they consider that anything outside of the basic character set is part of an identifier - and then reject them if they do not follow the XID requirements.
>
> Beside that oddity, no one rejects nonsensical pp-token.
> The restriction itself is not consistent.
>
> Either we want to allow codepoints to appear in source even if they are not part of grammar elements... or we do not.
> Whether they are part of the basic character set should have no impact.
>
>
> So either we want to say
>> *If any character matches the last category, the program is ill-formed.*
Sounds plausible to me, but maybe causes more friction for
"$identifier" and similar?
I'm also interested in how this would interact with $identifier. All I really care about is that tokenization stays fairly consistent across implementations, and making $ ill-formed per [lex.pptoken] paragraph 1 seems to move away from that.
If we make the change allowing $identifier first and require that to be a preprocessing-token consistently, then adjusting the rule like Corentin suggested seems fine.
>> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last category, the program is ill-formed.*
>
> It seems to be that the use case for allowing these non-tokens tokens to exist is to allow concatenation shenanigans.
> But you can do token concatenation shenanigans with non-basic character set
>
> #define __CONCAT(A,B) A ## B
> #define CONCAT(A,B) __CONCAT(A, B)
> #define ONE ፩ // not an identifier (xid_start = false, xid_continue = true)
> int CONCAT(A, ONE);
>
> Is that useful? Maybe not. Is that less useful than the shenanigans that /_are_/ allowed? Also no!
While we discussed "$identifier", you said you didn't want phase-4 identifiers
and phase-7 identifiers to be under different rules.
Only in phase 4 do we have non-identifier preprocessing tokens that look
roughly like text (these single-character things); those are all ill-formed
already when transitioning to phase 7.
One consistent view might be to allow everything as preprocessing tokens in phase 4
(maybe concatenation and stringization will make those vanish before phase 7),
and then do all the checking when transitioning to phase 7.
Delaying the validity check as much as possible also seems reasonable to me.
However, I believe the rules were crafted this way to allow for extension points,
e.g. for the situation when we want "real" math symbols as operators.
On a side note, how is that supposed to work for such "extension points" when they consist of multiple code points?
For example, what if I wanted a prefix unary operator consisting of REGIONAL INDICATOR D and REGIONAL INDICATOR E (forming a German flag) translates the following string literal operand to German? It seems like that extension is not possible because we match "each non-whitespace character that cannot be one of the above" as one token. Maybe that use case is a bit contrived, but it seems plausible that for these extensions to work, you need a sequence of characters rather than a single character to form a preprocessing-token. Aren't there some mathematical glyphs that consist of more than one code point? However, that seems to open the door to implementation-specific tokenization once more.
Slightly unrelated thought: Maybe we want a separate "pp-non-identifier" preprocessing
token that contains the "allowed" single-characters (e.g. "@") explicitly, and then
we can just bluntly say during lexing "any character that doesn't end up as part of
a preprocessing token is ill-formed".
+1, that seems like an editorial improvement.