Date: Thu, 27 Feb 2025 12:27:58 +0100
As a recap of [lex.phases]:
Phase 3 performs lexing: Sequences of input characters are combined to
formed low-level chunks called preprocessing tokens (and whitespace possibly
separating them). [lex.pptoken] is the root of the lexer grammar,
and any guidance to the lexer needs to be in terms of that subclause
("If the input stream has been parsed into preprocessing tokens up to a given character:")
So, at each position in the input character sequence, there is a decision on
whether the next character starts a new preprocessing token or not.
Note that there are very limited look-ahead capabilities, in practice.
Any rules how to lex f-literals and x-literals (and their raw counterparts)
must fit that framework. (We don't need formal wording at this stage,
but we need to understand what the rules are.)
A reference to phase 7 ("expression") at this point is a serious impedance
mismatch and layering violation and must not appear in the actual
rule definitions (as opposed to in the rationale for the lexer rules
as presented).
So, in short, if I have a sequence of characters like
f"{blah} f"x{huh}y" uh"
or
fR"{foo)} f"x{huh : {abc} }y" uh"
or
fR"{blah} R"x{huh}y" uh"
or
f"{f"{hi}"}"
what is the sequence of preprocessing tokens resulting from lexing each
of these instances? A useful technique might be to introduce artificial
non-utterable preprocessing tokens, similar to what we did for modules
with _import-keyword_ and friends.
The paper should have copious examples for these and similar situations,
including the various colon disambiguation situations.
If the lexer now has to keep track of matching open/closing
parentheses (or other non-local state), that would be a rather novel
requirement, and should be highlighted for EWG consideration.
Phase 4 takes a sequence of preprocessing tokens (and whitespace
possibly separating them) and applies preprocessing, which is
essentially macro expansion and interpretation of directives.
If some such features should not happen within certain sequences
of preprocessing tokens, this needs to be specified, expressed in
the "sequence of preprocessing tokens" output of phase 3.
Phases 5 and 6 concatenate string-literals, but might be extended
to concatenate other sequences of preprocessing tokens.
Note that there is no a-priori notion of "nesting" (e.g. of parentheses)
before phase 7, although some special cases talk about "outside-most
matching parentheses".
We have messed with the preprocessor for modules, and it took
more than one iteration of rather extensive remodeling before
reaching the current specification state, which feels somewhat
stable. I do not want to be in that "repeated remodeling" situation
ever again; Richard Smith does not appear to be available for
such.
Jens
Phase 3 performs lexing: Sequences of input characters are combined to
formed low-level chunks called preprocessing tokens (and whitespace possibly
separating them). [lex.pptoken] is the root of the lexer grammar,
and any guidance to the lexer needs to be in terms of that subclause
("If the input stream has been parsed into preprocessing tokens up to a given character:")
So, at each position in the input character sequence, there is a decision on
whether the next character starts a new preprocessing token or not.
Note that there are very limited look-ahead capabilities, in practice.
Any rules how to lex f-literals and x-literals (and their raw counterparts)
must fit that framework. (We don't need formal wording at this stage,
but we need to understand what the rules are.)
A reference to phase 7 ("expression") at this point is a serious impedance
mismatch and layering violation and must not appear in the actual
rule definitions (as opposed to in the rationale for the lexer rules
as presented).
So, in short, if I have a sequence of characters like
f"{blah} f"x{huh}y" uh"
or
fR"{foo)} f"x{huh : {abc} }y" uh"
or
fR"{blah} R"x{huh}y" uh"
or
f"{f"{hi}"}"
what is the sequence of preprocessing tokens resulting from lexing each
of these instances? A useful technique might be to introduce artificial
non-utterable preprocessing tokens, similar to what we did for modules
with _import-keyword_ and friends.
The paper should have copious examples for these and similar situations,
including the various colon disambiguation situations.
If the lexer now has to keep track of matching open/closing
parentheses (or other non-local state), that would be a rather novel
requirement, and should be highlighted for EWG consideration.
Phase 4 takes a sequence of preprocessing tokens (and whitespace
possibly separating them) and applies preprocessing, which is
essentially macro expansion and interpretation of directives.
If some such features should not happen within certain sequences
of preprocessing tokens, this needs to be specified, expressed in
the "sequence of preprocessing tokens" output of phase 3.
Phases 5 and 6 concatenate string-literals, but might be extended
to concatenate other sequences of preprocessing tokens.
Note that there is no a-priori notion of "nesting" (e.g. of parentheses)
before phase 7, although some special cases talk about "outside-most
matching parentheses".
We have messed with the preprocessor for modules, and it took
more than one iteration of rather extensive remodeling before
reaching the current specification state, which feels somewhat
stable. I do not want to be in that "repeated remodeling" situation
ever again; Richard Smith does not appear to be available for
such.
Jens
Received on 2025-02-27 11:28:01