On Fri, Apr 10, 2020 at 3:29 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 10/04/2020 01.39, Hubert Tong wrote:
> On Thu, Apr 9, 2020 at 6:34 PM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>     On 09/04/2020 23.01, Hubert Tong wrote:
>     > On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16@lists.isocpp.org <mailto:sg16@lists.isocpp.org> <mailto:sg16@lists.isocpp.org <mailto:sg16@lists.isocpp.org>>> wrote:
>     >
>     >
>     >     See attached.
>     >
>     > We can only keep preprocessor tokens in the pp-identifier space for so long. Any time scanning for object-like macro names or function-like macro invocations occur, we would need identifiers in NFC (barring implementation short cuts). The same applies to finding parameters in the replacement list of a function-like macro. Shifting the place where requirements are checked is sufficient to avoid token concatenation UB/behavioural surprises.
>     I have trouble relating these comments to my document.
>     The idea of the pp-identifier introduction was to postpone
>     the NFC check until we actually need an /identifier/.
>     Are you saying that we need statements in [cpp.replace] p10 and p12
>     and [cpp.subst] p1 clarifying that we transition from pp-identifier
>     to identifier when identifying macro names and parameter names?
> Yes.

I've thought a little more about that.

There is clearly an /identifier/ in the preprocessor grammar when
defining these things, so we're good in that regard.

When we mention a macro or parameter in the program text and its
spelling is not the same as the /identifier/ in the definition
(for example, because of non-NFC), we simply don't replace,
and the thing is (attempted) to be turned into an /identifier/
in phase 7, with subsequent failure.
There's no guarantee that it doesn't make its way into the result of stringization without triggering a diagnostic. The failure to replace something that would be an invocation of a macro if the pp-identifier was run through an NFC normalization process is closer to something that the paper was supposed to prevent than something that paper should endorse.

Since we need to trial-match every pp-identifier we encounter
against defined macro names and macro parameters, we'd otherwise
do the transition during preprocessing all the time, which
is undesirable for token-pasting:

#define combine(X,Y) X ## Y
#define stringize(X) # X
char8_t * s = stringize(combine(A,\u0300));

When we rescan the intermediate macro-replacement result

for more macro replacement, A\u0300 shouldn't be ill-formed
right there.
I'm not convinced that there is sufficient motivation to allow this. I understand the motivation to side-step UB, but that does not require the rescan to be happy.

Notes added.

>     That sounds reasonable.
>     >     Note that a universal-character-name never
>     >     represents a member of the basic source
>     >     character set, so we don't have to call out
>     >     underscores specifically.
>     >
>     >     This makes any sequence involving a universal-character-name
>     >     a pp-identifier (and thus a preprocessing-token), so that
>     >
>     >     #define accent(x) x ## \u0300
>     >
>     >     does the right thing.
>     >
>     > I'm fine with pp-identifier as a lexing tool, but I think allowing them to start with combining characters or to contain whitespace characters outside of the basic source character set is not advisable. That is, we should have a rule that validates pp-identifiers after determining the characters that it encompasses.
>     But if we don't make pp-identifier very accommodating,
>     #define accent(x) x ## \u0300
>     will never do anything useful, because \u0300 would not be a
>     preprocessing-token, so the preprocessing-token adjacent to ##
>     is just the backslash.
>     It's hard to formulate, in the presence of lexer max-munch,
>     when we continue lexing vs. we stop because some non-grammatical
>     restriction is no longer satisfied.  I was trying to avoid that
>     by making the transition pp-identifier -> identifier explicit.
> I understand the motivation, but I am having a hard time with formalizing "invisible" preprocessing tokens that don't "fail fast". These really are most interesting in conjunction with token pasting, so I think we're looking at either making the above pasting ill-formed (just because there is a \u0300 as a pp-identifier)

My guess is that this would be fine, in your view:

#define combine(X,Y) X ## Y
#define stringize(X) # X
char8_t * s = stringize(combine(A,\u0300));

So, we're only looking at a special rule that the pp-identifier lexically
following ## be an identifier, to avoid any source-code level "viewing in
editor" confusion.  But why is ## special in this regard, as opposed to
(say) the comma in

char8_t * s = stringize(combine(A,\u0300));

where similar confusion might arise?
I'm not happy with the comma either.

(This example should find its way into the paper as "ok", btw.)

>  or we are going to have that token pasting at the cost of needing to deal with pasting characters that are "invisible" or appear to modify the appearance "##" itself.

But that "dealing" is only a user-education issue, not a technical
specification issue.  Or am I missing something?
This is a design issue. This paper supposedly improves the situation around Unicode in source code, but it clarifies certain cases as having the "bad behaviour". That is, it "encourages" the lexing to consider operators and punctuators are unmodified by Unicode-related mechanisms.

I'm inclined to not modify the wording in this area.

>     >     Did someone check that UAX #31 really is part of ISO 10646?
>     >
>     > It isn't, but what we need a cross reference to from the non-Annex wording is the Unicode Character Database, which is also not part of ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual technical content--is down; they managed to keep the fluff site up...).
>     It seems we don't really need a normative reference to UAX #31;
>     all we need is a normative reference to the database and a
>     bibliography entry for UAX #31.
> Agreed.

www.unicode.org is still down, it seems.  :-(

Updated wording is attached.