sg16: Re: [SG16] Wording for UAX #31 identifiers

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Wed, 22 Apr 2020 11:44:20 -0400

On Wed, Apr 22, 2020 at 1:45 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 22/04/2020 04.38, Hubert Tong wrote:
> > On Fri, Apr 10, 2020 at 3:29 AM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> > When we mention a macro or parameter in the program text and its
> > spelling is not the same as the /identifier/ in the definition
> > (for example, because of non-NFC), we simply don't replace,
> > and the thing is (attempted) to be turned into an /identifier/
> > in phase 7, with subsequent failure.
> >
> > There's no guarantee that it doesn't make its way into the result of
> stringization without triggering a diagnostic.
>
> Right. But the contents of strings is not required to be in NFC, so there
> is
> no (obvious) problem.
>
The problem is that a user may not get the string they want. That is:

Online compiler link
<https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,j:1,lang:c%2B%2B,selection:(endColumn:27,endLineNumber:4,positionColumn:27,positionLineNumber:4,selectionStartColumn:27,selectionStartLineNumber:4,startColumn:27,startLineNumber:4),source:'%23define+%C3%A0+Hello!!%0A%23define+STR2(+X+)++%23+X%0A%23define+STR(+X+)++STR2(X)%0Aconst+char+*msg+%3D+STR(a%CC%80)%3B'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:33.333333333333336,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:gsnapshot,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'1',trim:'1'),fontScale:14,j:1,lang:c%2B%2B,libs:!(),options:'',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'x86-64+gcc+(trunk)+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:33.333333333333336,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compiler:1,editor:1,fontScale:14,wrap:'1'),l:'5',n:'0',o:'%231+with+x86-64+gcc+(trunk)',t:'0')),k:33.33333333333333,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4>

#define à Hello!
#define STR2( X ) # X
#define STR( X ) STR2(X)
const char *msg = STR(à); // encoded as a{U+0300}

does not give "Hello!" but gives "a\u0300".

>
> > The failure to replace something that would be an invocation of a
> macro if the pp-identifier was run through an NFC normalization process is
> closer to something that the paper was supposed to prevent than something
> that paper should endorse.
>
> If we want to go there (to be discussed today), then we should lex
> pp-identifier as currently specified in the paper, but immediately check
> for NFC afterwards.
>
> This does address lone combining marks, but does not discuss
> other situations where a valid pp-identifier is not an
> identifier for reasons of violating the XID_Start / XID_Continue
> rules. Do you have an opinion on that?
>
I believe that allowing these pp-tokens to last longer seems to fly in the
face of the recommendations that we are trying to adopt. The violations of
XID_Start/XID_Continue leaves us with compatibility-in-language-evolution
issues in the context of adopting Unicode symbols that are acceptable for
use as operators, punctuators, or whitespace.

>
> > Since we need to trial-match every pp-identifier we encounter
> > against defined macro names and macro parameters, we'd otherwise
> > do the transition during preprocessing all the time, which
> > is undesirable for token-pasting:
> >
> > #define combine(X,Y) X ## Y
> > #define stringize(X) # X
> > char8_t * s = stringize(combine(A,\u0300));
> >
> > When we rescan the intermediate macro-replacement result
> > stringize(A\u0300)
> >
> > for more macro replacement, A\u0300 shouldn't be ill-formed
> > right there.
> >
> > I'm not convinced that there is sufficient motivation to allow this. I
> understand the motivation to side-step UB, but that does not require the
> rescan to be happy.
>
> Let's see what happens later today.
>
>
> > > or we are going to have that token pasting at the cost of needing
> to deal with pasting characters that are "invisible" or appear to modify
> the appearance "##" itself.
> >
> > But that "dealing" is only a user-education issue, not a technical
> > specification issue. Or am I missing something?
> >
> > This is a design issue.
>
> Sure.
>
> > This paper supposedly improves the situation around Unicode in source
> code, but it clarifies certain cases as having the "bad behaviour". That
> is, it "encourages" the lexing to consider operators and punctuators are
> unmodified by Unicode-related mechanisms.
>
> Yes
>
> +\u0300
>
> is one operator-or-punctuator and one pp-identifier during lexing.
>
> I guess the summary here is "Unicode considers combining marks as
> sticking to the preceding character. It's a bad design for C++
> to break this up during lexing, without giving a diagnostic when
> doing so."
>
Yes, I believe so.

>
> Jens
>

Received on 2020-04-22 10:47:34