Date: Thu, 25 Jun 2026 16:20:18 +0200
On 6/25/26 11:51, Corentin via SG16 wrote:
> We say that
>
> If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in the basic character set matches the last category, the program is ill-formed.
>
>
> But no one implements that restriction https://compiler-explorer.com/z/fe9jWcE1G <https://compiler-explorer.com/z/fe9jWcE1G>
>
> In non-pedantic mode, GCC accepts virtually anything as an identifier, and in pedantic mode they consider that anything outside of the basic character set is part of an identifier - and then reject them if they do not follow the XID requirements.
>
> Beside that oddity, no one rejects nonsensical pp-token.
> The restriction itself is not consistent.
>
> Either we want to allow codepoints to appear in source even if they are not part of grammar elements... or we do not.
> Whether they are part of the basic character set should have no impact.
>
>
> So either we want to say
>> *If any character matches the last category, the program is ill-formed.*
Sounds plausible to me, but maybe causes more friction for
"$identifier" and similar?
>> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last category, the program is ill-formed.*
>
> It seems to be that the use case for allowing these non-tokens tokens to exist is to allow concatenation shenanigans.
> But you can do token concatenation shenanigans with non-basic character set
>
> #define __CONCAT(A,B) A ## B
> #define CONCAT(A,B) __CONCAT(A, B)
> #define ONE ፩ // not an identifier (xid_start = false, xid_continue = true)
> int CONCAT(A, ONE);
>
> Is that useful? Maybe not. Is that less useful than the shenanigans that /_are_/ allowed? Also no!
While we discussed "$identifier", you said you didn't want phase-4 identifiers
and phase-7 identifiers to be under different rules.
Only in phase 4 do we have non-identifier preprocessing tokens that look
roughly like text (these single-character things); those are all ill-formed
already when transitioning to phase 7.
One consistent view might be to allow everything as preprocessing tokens in phase 4
(maybe concatenation and stringization will make those vanish before phase 7),
and then do all the checking when transitioning to phase 7.
However, I believe the rules were crafted this way to allow for extension points,
e.g. for the situation when we want "real" math symbols as operators.
Slightly unrelated thought: Maybe we want a separate "pp-non-identifier" preprocessing
token that contains the "allowed" single-characters (e.g. "@") explicitly, and then
we can just bluntly say during lexing "any character that doesn't end up as part of
a preprocessing token is ill-formed".
Jens
> We say that
>
> If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in the basic character set matches the last category, the program is ill-formed.
>
>
> But no one implements that restriction https://compiler-explorer.com/z/fe9jWcE1G <https://compiler-explorer.com/z/fe9jWcE1G>
>
> In non-pedantic mode, GCC accepts virtually anything as an identifier, and in pedantic mode they consider that anything outside of the basic character set is part of an identifier - and then reject them if they do not follow the XID requirements.
>
> Beside that oddity, no one rejects nonsensical pp-token.
> The restriction itself is not consistent.
>
> Either we want to allow codepoints to appear in source even if they are not part of grammar elements... or we do not.
> Whether they are part of the basic character set should have no impact.
>
>
> So either we want to say
>> *If any character matches the last category, the program is ill-formed.*
Sounds plausible to me, but maybe causes more friction for
"$identifier" and similar?
>> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last category, the program is ill-formed.*
>
> It seems to be that the use case for allowing these non-tokens tokens to exist is to allow concatenation shenanigans.
> But you can do token concatenation shenanigans with non-basic character set
>
> #define __CONCAT(A,B) A ## B
> #define CONCAT(A,B) __CONCAT(A, B)
> #define ONE ፩ // not an identifier (xid_start = false, xid_continue = true)
> int CONCAT(A, ONE);
>
> Is that useful? Maybe not. Is that less useful than the shenanigans that /_are_/ allowed? Also no!
While we discussed "$identifier", you said you didn't want phase-4 identifiers
and phase-7 identifiers to be under different rules.
Only in phase 4 do we have non-identifier preprocessing tokens that look
roughly like text (these single-character things); those are all ill-formed
already when transitioning to phase 7.
One consistent view might be to allow everything as preprocessing tokens in phase 4
(maybe concatenation and stringization will make those vanish before phase 7),
and then do all the checking when transitioning to phase 7.
However, I believe the rules were crafted this way to allow for extension points,
e.g. for the situation when we want "real" math symbols as operators.
Slightly unrelated thought: Maybe we want a separate "pp-non-identifier" preprocessing
token that contains the "allowed" single-characters (e.g. "@") explicitly, and then
we can just bluntly say during lexing "any character that doesn't end up as part of
a preprocessing token is ill-formed".
Jens
Received on 2026-06-25 14:20:26
