Date: Thu, 25 Jun 2026 11:51:02 +0200
We say that
If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in
> the basic character set matches the last category, the program is
> ill-formed.
But no one implements that restriction
https://compiler-explorer.com/z/fe9jWcE1G
In non-pedantic mode, GCC accepts virtually anything as an identifier, and
in pedantic mode they consider that anything outside of the basic character
set is part of an identifier - and then reject them if they do not follow
the XID requirements.
Beside that oddity, no one rejects nonsensical pp-token.
The restriction itself is not consistent.
Either we want to allow codepoints to appear in source even if they are not
part of grammar elements... or we do not.
Whether they are part of the basic character set should have no impact.
So either we want to say
> *If any character matches the last category, the program is ill-formed.*
or
> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last
category, the program is ill-formed.*
It seems to be that the use case for allowing these non-tokens tokens to
exist is to allow concatenation shenanigans.
But you can do token concatenation shenanigans with non-basic character set
#define __CONCAT(A,B) A ## B
#define CONCAT(A,B) __CONCAT(A, B)
#define ONE ፩ // not an identifier (xid_start = false, xid_continue = true)
int CONCAT(A, ONE);
Is that useful? Maybe not. Is that less useful than the shenanigans that
*_are_* allowed? Also no!
I'm curious what people think.
Cheers
If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in
> the basic character set matches the last category, the program is
> ill-formed.
But no one implements that restriction
https://compiler-explorer.com/z/fe9jWcE1G
In non-pedantic mode, GCC accepts virtually anything as an identifier, and
in pedantic mode they consider that anything outside of the basic character
set is part of an identifier - and then reject them if they do not follow
the XID requirements.
Beside that oddity, no one rejects nonsensical pp-token.
The restriction itself is not consistent.
Either we want to allow codepoints to appear in source even if they are not
part of grammar elements... or we do not.
Whether they are part of the basic character set should have no impact.
So either we want to say
> *If any character matches the last category, the program is ill-formed.*
or
> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last
category, the program is ill-formed.*
It seems to be that the use case for allowing these non-tokens tokens to
exist is to allow concatenation shenanigans.
But you can do token concatenation shenanigans with non-basic character set
#define __CONCAT(A,B) A ## B
#define CONCAT(A,B) __CONCAT(A, B)
#define ONE ፩ // not an identifier (xid_start = false, xid_continue = true)
int CONCAT(A, ONE);
Is that useful? Maybe not. Is that less useful than the shenanigans that
*_are_* allowed? Also no!
I'm curious what people think.
Cheers
Received on 2026-06-25 09:51:26
