Date: Fri, 26 Jun 2026 10:58:53 +0200
On Thu, 25 Jun 2026 at 16:20, Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:
>
>
> On 6/25/26 11:51, Corentin via SG16 wrote:
> > We say that
> >
> > If a U+0027 apostrophe, a U+0022 quotation mark, or any character
> not in the basic character set matches the last category, the program is
> ill-formed.
> >
> >
> > But no one implements that restriction
> https://compiler-explorer.com/z/fe9jWcE1G <
> https://compiler-explorer.com/z/fe9jWcE1G>
> >
> > In non-pedantic mode, GCC accepts virtually anything as an identifier,
> and in pedantic mode they consider that anything outside of the basic
> character set is part of an identifier - and then reject them if they do
> not follow the XID requirements.
> >
> > Beside that oddity, no one rejects nonsensical pp-token.
> > The restriction itself is not consistent.
> >
> > Either we want to allow codepoints to appear in source even if they are
> not part of grammar elements... or we do not.
> > Whether they are part of the basic character set should have no impact.
> >
> >
> > So either we want to say
> >> *If any character matches the last category, the program is ill-formed.*
>
> Sounds plausible to me, but maybe causes more friction for
> "$identifier" and similar?
>
I'm also interested in how this would interact with $identifier. All I
really care about is that tokenization stays fairly consistent across
implementations, and making $ ill-formed per [lex.pptoken] paragraph 1
seems to move away from that.
If we make the change allowing $identifier first and require that to be a
preprocessing-token consistently, then adjusting the rule like Corentin
suggested seems fine.
>> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last
> category, the program is ill-formed.*
> >
> > It seems to be that the use case for allowing these non-tokens tokens to
> exist is to allow concatenation shenanigans.
> > But you can do token concatenation shenanigans with non-basic character
> set
> >
> > #define __CONCAT(A,B) A ## B
> > #define CONCAT(A,B) __CONCAT(A, B)
> > #define ONE ፩ // not an identifier (xid_start = false, xid_continue
> = true)
> > int CONCAT(A, ONE);
> >
> > Is that useful? Maybe not. Is that less useful than the shenanigans that
> /_are_/ allowed? Also no!
>
> While we discussed "$identifier", you said you didn't want phase-4
> identifiers
> and phase-7 identifiers to be under different rules.
> Only in phase 4 do we have non-identifier preprocessing tokens that look
> roughly like text (these single-character things); those are all ill-formed
> already when transitioning to phase 7.
>
> One consistent view might be to allow everything as preprocessing tokens
> in phase 4
> (maybe concatenation and stringization will make those vanish before phase
> 7),
> and then do all the checking when transitioning to phase 7.
>
Delaying the validity check as much as possible also seems reasonable to me.
However, I believe the rules were crafted this way to allow for extension
> points,
> e.g. for the situation when we want "real" math symbols as operators.
>
On a side note, how is that supposed to work for such "extension points"
when they consist of multiple code points?
For example, what if I wanted a prefix unary operator consisting of
REGIONAL INDICATOR D and REGIONAL INDICATOR E (forming a German flag)
translates the following string literal operand to German? It seems like
that extension is not possible because we match "each non-whitespace
character that cannot be one of the above" as one token. Maybe that use
case is a bit contrived, but it seems plausible that for these extensions
to work, you need a sequence of characters rather than a single character
to form a *preprocessing-token*. Aren't there some mathematical glyphs that
consist of more than one code point? However, that seems to open the door
to implementation-specific tokenization once more.
> Slightly unrelated thought: Maybe we want a separate "pp-non-identifier"
> preprocessing
> token that contains the "allowed" single-characters (e.g. "@") explicitly,
> and then
> we can just bluntly say during lexing "any character that doesn't end up
> as part of
> a preprocessing token is ill-formed".
>
+1, that seems like an editorial improvement.
wrote:
>
>
> On 6/25/26 11:51, Corentin via SG16 wrote:
> > We say that
> >
> > If a U+0027 apostrophe, a U+0022 quotation mark, or any character
> not in the basic character set matches the last category, the program is
> ill-formed.
> >
> >
> > But no one implements that restriction
> https://compiler-explorer.com/z/fe9jWcE1G <
> https://compiler-explorer.com/z/fe9jWcE1G>
> >
> > In non-pedantic mode, GCC accepts virtually anything as an identifier,
> and in pedantic mode they consider that anything outside of the basic
> character set is part of an identifier - and then reject them if they do
> not follow the XID requirements.
> >
> > Beside that oddity, no one rejects nonsensical pp-token.
> > The restriction itself is not consistent.
> >
> > Either we want to allow codepoints to appear in source even if they are
> not part of grammar elements... or we do not.
> > Whether they are part of the basic character set should have no impact.
> >
> >
> > So either we want to say
> >> *If any character matches the last category, the program is ill-formed.*
>
> Sounds plausible to me, but maybe causes more friction for
> "$identifier" and similar?
>
I'm also interested in how this would interact with $identifier. All I
really care about is that tokenization stays fairly consistent across
implementations, and making $ ill-formed per [lex.pptoken] paragraph 1
seems to move away from that.
If we make the change allowing $identifier first and require that to be a
preprocessing-token consistently, then adjusting the rule like Corentin
suggested seems fine.
>> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last
> category, the program is ill-formed.*
> >
> > It seems to be that the use case for allowing these non-tokens tokens to
> exist is to allow concatenation shenanigans.
> > But you can do token concatenation shenanigans with non-basic character
> set
> >
> > #define __CONCAT(A,B) A ## B
> > #define CONCAT(A,B) __CONCAT(A, B)
> > #define ONE ፩ // not an identifier (xid_start = false, xid_continue
> = true)
> > int CONCAT(A, ONE);
> >
> > Is that useful? Maybe not. Is that less useful than the shenanigans that
> /_are_/ allowed? Also no!
>
> While we discussed "$identifier", you said you didn't want phase-4
> identifiers
> and phase-7 identifiers to be under different rules.
> Only in phase 4 do we have non-identifier preprocessing tokens that look
> roughly like text (these single-character things); those are all ill-formed
> already when transitioning to phase 7.
>
> One consistent view might be to allow everything as preprocessing tokens
> in phase 4
> (maybe concatenation and stringization will make those vanish before phase
> 7),
> and then do all the checking when transitioning to phase 7.
>
Delaying the validity check as much as possible also seems reasonable to me.
However, I believe the rules were crafted this way to allow for extension
> points,
> e.g. for the situation when we want "real" math symbols as operators.
>
On a side note, how is that supposed to work for such "extension points"
when they consist of multiple code points?
For example, what if I wanted a prefix unary operator consisting of
REGIONAL INDICATOR D and REGIONAL INDICATOR E (forming a German flag)
translates the following string literal operand to German? It seems like
that extension is not possible because we match "each non-whitespace
character that cannot be one of the above" as one token. Maybe that use
case is a bit contrived, but it seems plausible that for these extensions
to work, you need a sequence of characters rather than a single character
to form a *preprocessing-token*. Aren't there some mathematical glyphs that
consist of more than one code point? However, that seems to open the door
to implementation-specific tokenization once more.
> Slightly unrelated thought: Maybe we want a separate "pp-non-identifier"
> preprocessing
> token that contains the "allowed" single-characters (e.g. "@") explicitly,
> and then
> we can just bluntly say during lexing "any character that doesn't end up
> as part of
> a preprocessing token is ill-formed".
>
+1, that seems like an editorial improvement.
Received on 2026-06-26 08:59:10
