sg16: Re: [SG16] Wording for UAX #31 identifiers

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 22 Apr 2020 07:45:40 +0200

On 22/04/2020 04.38, Hubert Tong wrote:
> On Fri, Apr 10, 2020 at 3:29 AM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:

> When we mention a macro or parameter in the program text and its
> spelling is not the same as the /identifier/ in the definition
> (for example, because of non-NFC), we simply don't replace,
> and the thing is (attempted) to be turned into an /identifier/
> in phase 7, with subsequent failure.
>
> There's no guarantee that it doesn't make its way into the result of stringization without triggering a diagnostic.

Right. But the contents of strings is not required to be in NFC, so there is
no (obvious) problem.

> The failure to replace something that would be an invocation of a macro if the pp-identifier was run through an NFC normalization process is closer to something that the paper was supposed to prevent than something that paper should endorse.

If we want to go there (to be discussed today), then we should lex
pp-identifier as currently specified in the paper, but immediately check
for NFC afterwards.

This does address lone combining marks, but does not discuss
other situations where a valid pp-identifier is not an
identifier for reasons of violating the XID_Start / XID_Continue
rules. Do you have an opinion on that?

> Since we need to trial-match every pp-identifier we encounter
> against defined macro names and macro parameters, we'd otherwise
> do the transition during preprocessing all the time, which
> is undesirable for token-pasting:
>
> #define combine(X,Y) X ## Y
> #define stringize(X) # X
> char8_t * s = stringize(combine(A,\u0300));
>
> When we rescan the intermediate macro-replacement result
> stringize(A\u0300)
>
> for more macro replacement, A\u0300 shouldn't be ill-formed
> right there.
>
> I'm not convinced that there is sufficient motivation to allow this. I understand the motivation to side-step UB, but that does not require the rescan to be happy.

Let's see what happens later today.

> > or we are going to have that token pasting at the cost of needing to deal with pasting characters that are "invisible" or appear to modify the appearance "##" itself.
>
> But that "dealing" is only a user-education issue, not a technical
> specification issue. Or am I missing something?
>
> This is a design issue.

Sure.

> This paper supposedly improves the situation around Unicode in source code, but it clarifies certain cases as having the "bad behaviour". That is, it "encourages" the lexing to consider operators and punctuators are unmodified by Unicode-related mechanisms.

Yes

+\u0300

is one operator-or-punctuator and one pp-identifier during lexing.

I guess the summary here is "Unicode considers combining marks as
sticking to the preceding character. It's a bad design for C++
to break this up during lexing, without giving a diagnostic when
doing so."

Jens

Received on 2020-04-22 00:48:40