C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Simplifying [lex.pptoken]/p1

From: Matthias Wippich <mfwippich_at_[hidden]>
Date: Fri, 26 Jun 2026 13:21:33 +0200
> I'm also interested in how this would interact with $identifier. All I really care about is that tokenization stays fairly consistent across implementations

Similar to removing it from the basic character set, it reverts the
status to "implementations can do whatever they want". It would become
a conforming extension again, but we give absolutely zero guarantees
about it. In practice this is similar to option 1 (since that's what
most implementations actually implement currently), but I believe it'd
give us even less guarantees than option 1 would.

The only truly portable option that actually removes
implementation-specific behavior is "allow $idents everywhere
unconditionally", which isn't currently proposed by the paper. I'll
add that, since it seems like the unimplementability claim doesn't
really seem to hold in $current_year (see SG16 mattermost). Aside from
that only option 2 and option 3 give you consistent preprocessor token
boundaries since $ is added to nondigit for both. Option 1, removing
it from the basic character set and messing with [lex.pptoken]/1 do
not do that.


Best,
Matthias

On Fri, Jun 26, 2026 at 10:59 AM Jan Schultke via SG16
<sg16_at_[hidden]> wrote:
>
>
>
> On Thu, 25 Jun 2026 at 16:20, Jens Maurer via SG16 <sg16_at_[hidden]> wrote:
>>
>>
>>
>> On 6/25/26 11:51, Corentin via SG16 wrote:
>> > We say that
>> >
>> > If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in the basic character set matches the last category, the program is ill-formed.
>> >
>> >
>> > But no one implements that restriction https://compiler-explorer.com/z/fe9jWcE1G <https://compiler-explorer.com/z/fe9jWcE1G>
>> >
>> > In non-pedantic mode, GCC accepts virtually anything as an identifier, and in pedantic mode they consider that anything outside of the basic character set is part of an identifier - and then reject them if they do not follow the XID requirements.
>> >
>> > Beside that oddity, no one rejects nonsensical pp-token.
>> > The restriction itself is not consistent.
>> >
>> > Either we want to allow codepoints to appear in source even if they are not part of grammar elements... or we do not.
>> > Whether they are part of the basic character set should have no impact.
>> >
>> >
>> > So either we want to say
>> >> *If any character matches the last category, the program is ill-formed.*
>>
>> Sounds plausible to me, but maybe causes more friction for
>> "$identifier" and similar?
>
>
> I'm also interested in how this would interact with $identifier. All I really care about is that tokenization stays fairly consistent across implementations, and making $ ill-formed per [lex.pptoken] paragraph 1 seems to move away from that.
>
> If we make the change allowing $identifier first and require that to be a preprocessing-token consistently, then adjusting the rule like Corentin suggested seems fine.
>
>> >> *If a U+0027 apostrophe or a U+0022 quotation mark matches the last category, the program is ill-formed.*
>> >
>> > It seems to be that the use case for allowing these non-tokens tokens to exist is to allow concatenation shenanigans.
>> > But you can do token concatenation shenanigans with non-basic character set
>> >
>> > #define __CONCAT(A,B) A ## B
>> > #define CONCAT(A,B) __CONCAT(A, B)
>> > #define ONE ፩ // not an identifier (xid_start = false, xid_continue = true)
>> > int CONCAT(A, ONE);
>> >
>> > Is that useful? Maybe not. Is that less useful than the shenanigans that /_are_/ allowed? Also no!
>>
>> While we discussed "$identifier", you said you didn't want phase-4 identifiers
>> and phase-7 identifiers to be under different rules.
>> Only in phase 4 do we have non-identifier preprocessing tokens that look
>> roughly like text (these single-character things); those are all ill-formed
>> already when transitioning to phase 7.
>>
>> One consistent view might be to allow everything as preprocessing tokens in phase 4
>> (maybe concatenation and stringization will make those vanish before phase 7),
>> and then do all the checking when transitioning to phase 7.
>
>
> Delaying the validity check as much as possible also seems reasonable to me.
>
>> However, I believe the rules were crafted this way to allow for extension points,
>> e.g. for the situation when we want "real" math symbols as operators.
>
>
> On a side note, how is that supposed to work for such "extension points" when they consist of multiple code points?
>
> For example, what if I wanted a prefix unary operator consisting of REGIONAL INDICATOR D and REGIONAL INDICATOR E (forming a German flag) translates the following string literal operand to German? It seems like that extension is not possible because we match "each non-whitespace character that cannot be one of the above" as one token. Maybe that use case is a bit contrived, but it seems plausible that for these extensions to work, you need a sequence of characters rather than a single character to form a preprocessing-token. Aren't there some mathematical glyphs that consist of more than one code point? However, that seems to open the door to implementation-specific tokenization once more.
>
>>
>> Slightly unrelated thought: Maybe we want a separate "pp-non-identifier" preprocessing
>> token that contains the "allowed" single-characters (e.g. "@") explicitly, and then
>> we can just bluntly say during lexing "any character that doesn't end up as part of
>> a preprocessing token is ill-formed".
>
>
> +1, that seems like an editorial improvement.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> Link to this post: http://lists.isocpp.org/sg16/2026/06/4773.php

Received on 2026-06-26 11:21:47