On Wed, Apr 22, 2020 at 1:45 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 22/04/2020 04.38, Hubert Tong wrote:
> On Fri, Apr 10, 2020 at 3:29 AM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:

> When we mention a macro or parameter in the program text and its
> spelling is not the same as the /identifier/ in the definition
> (for example, because of non-NFC), we simply don't replace,
> and the thing is (attempted) to be turned into an /identifier/
> in phase 7, with subsequent failure.
>
> There's no guarantee that it doesn't make its way into the result of stringization without triggering a diagnostic.

Right. But the contents of strings is not required to be in NFC, so there is
no (obvious) problem.

The problem is that a user may not get the string they want. That is:

#define à Hello!
#define STR2( X ) # X
#define STR( X ) STR2(X)
const char *msg = STR(à); // encoded as a{U+0300}

does not give "Hello!" but gives "a\u0300".

> The failure to replace something that would be an invocation of a macro if the pp-identifier was run through an NFC normalization process is closer to something that the paper was supposed to prevent than something that paper should endorse.

If we want to go there (to be discussed today), then we should lex
pp-identifier as currently specified in the paper, but immediately check
for NFC afterwards.

This does address lone combining marks, but does not discuss
other situations where a valid pp-identifier is not an
identifier for reasons of violating the XID_Start / XID_Continue
rules. Do you have an opinion on that?

I believe that allowing these pp-tokens to last longer seems to fly in the face of the recommendations that we are trying to adopt. The violations of XID_Start/XID_Continue leaves us with compatibility-in-language-evolution issues in the context of adopting Unicode symbols that are acceptable for use as operators, punctuators, or whitespace.

> Since we need to trial-match every pp-identifier we encounter
> against defined macro names and macro parameters, we'd otherwise
> do the transition during preprocessing all the time, which
> is undesirable for token-pasting:
>
> #define combine(X,Y) X ## Y
> #define stringize(X) # X
> char8_t * s = stringize(combine(A,\u0300));
>
> When we rescan the intermediate macro-replacement result
> stringize(A\u0300)
>
> for more macro replacement, A\u0300 shouldn't be ill-formed
> right there.
>
> I'm not convinced that there is sufficient motivation to allow this. I understand the motivation to side-step UB, but that does not require the rescan to be happy.

Let's see what happens later today.

> > or we are going to have that token pasting at the cost of needing to deal with pasting characters that are "invisible" or appear to modify the appearance "##" itself.
>
> But that "dealing" is only a user-education issue, not a technical
> specification issue. Or am I missing something?
>
> This is a design issue.

Sure.

> This paper supposedly improves the situation around Unicode in source code, but it clarifies certain cases as having the "bad behaviour". That is, it "encourages" the lexing to consider operators and punctuators are unmodified by Unicode-related mechanisms.

Yes

+\u0300

is one operator-or-punctuator and one pp-identifier during lexing.

I guess the summary here is "Unicode considers combining marks as
sticking to the preceding character. It's a bad design for C++
to break this up during lexing, without giving a diagnostic when
doing so."

Yes, I believe so.

Jens