We say that

If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in the basic character set matches the last category, the program is ill-formed.

But no one implements that restriction https://compiler-explorer.com/z/fe9jWcE1G

In non-pedantic mode, GCC accepts virtually anything as an identifier, and in pedantic mode they consider that anything outside of the basic character set is part of an identifier - and then reject them if they do not follow the XID requirements.

Beside that oddity, no one rejects nonsensical pp-token.
The restriction itself is not consistent.

Either we want to allow codepoints to appear in source even if they are not part of grammar elements... or we do not.
Whether they are part of the basic character set should have no impact.


So either we want to say
If any character matches the last category, the program is ill-formed.
or
If a U+0027 apostrophe or a U+0022 quotation mark matches the last category, the program is ill-formed.

It seems to be that the use case for allowing these non-tokens tokens to exist is to allow concatenation shenanigans.
But you can do token concatenation shenanigans with non-basic character set

#define __CONCAT(A,B) A ## B
#define CONCAT(A,B) __CONCAT(A, B)
#define ONE ፩  // not an identifier (xid_start = false, xid_continue = true)
int CONCAT(A, ONE);

Is that useful? Maybe not. Is that less useful than the shenanigans that _are_ allowed? Also no!


I'm curious what people think.

Cheers