On Thu, Apr 9, 2020 at 6:34 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 09/04/2020 23.01, Hubert Tong wrote:
> On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16@lists.isocpp.org <mailto:sg16@lists.isocpp.org>> wrote:
>
>
> See attached.
>
> We can only keep preprocessor tokens in the pp-identifier space for so long. Any time scanning for object-like macro names or function-like macro invocations occur, we would need identifiers in NFC (barring implementation short cuts). The same applies to finding parameters in the replacement list of a function-like macro. Shifting the place where requirements are checked is sufficient to avoid token concatenation UB/behavioural surprises.

I have trouble relating these comments to my document.
The idea of the pp-identifier introduction was to postpone
the NFC check until we actually need an /identifier/.

Are you saying that we need statements in [cpp.replace] p10 and p12
and [cpp.subst] p1 clarifying that we transition from pp-identifier
to identifier when identifying macro names and parameter names?

Yes.

That sounds reasonable.

> Note that a universal-character-name never
> represents a member of the basic source
> character set, so we don't have to call out
> underscores specifically.
>
> This makes any sequence involving a universal-character-name
> a pp-identifier (and thus a preprocessing-token), so that
>
> #define accent(x) x ## \u0300
>
> does the right thing.
>
> I'm fine with pp-identifier as a lexing tool, but I think allowing them to start with combining characters or to contain whitespace characters outside of the basic source character set is not advisable. That is, we should have a rule that validates pp-identifiers after determining the characters that it encompasses.

But if we don't make pp-identifier very accommodating,

#define accent(x) x ## \u0300

will never do anything useful, because \u0300 would not be a
preprocessing-token, so the preprocessing-token adjacent to ##
is just the backslash.

It's hard to formulate, in the presence of lexer max-munch,
when we continue lexing vs. we stop because some non-grammatical
restriction is no longer satisfied. I was trying to avoid that
by making the transition pp-identifier -> identifier explicit.

I understand the motivation, but I am having a hard time with formalizing "invisible" preprocessing tokens that don't "fail fast". These really are most interesting in conjunction with token pasting, so I think we're looking at either making the above pasting ill-formed (just because there is a \u0300 as a pp-identifier) or we are going to have that token pasting at the cost of needing to deal with pasting characters that are "invisible" or appear to modify the appearance "##" itself.

Note that all non-basic characters are represented as
universal-character-names at that point (whether they
appeared as such in the original source code or not).
I think we should never break up \uXXXX during lexing or
preprocessing.

I agree we should never break up \uXXXX or \UxxxxYYYY.

> Did someone check that UAX #31 really is part of ISO 10646?
>
> It isn't, but what we need a cross reference to from the non-Annex wording is the Unicode Character Database, which is also not part of ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual technical content--is down; they managed to keep the fluff site up...).

It seems we don't really need a normative reference to UAX #31;
all we need is a normative reference to the database and a
bibliography entry for UAX #31.

Agreed.