sg16: Re: [SG16] Wording for UAX #31 identifiers

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 21 Apr 2020 22:38:07 -0400

On Fri, Apr 10, 2020 at 3:29 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 10/04/2020 01.39, Hubert Tong wrote:
> > On Thu, Apr 9, 2020 at 6:34 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
> >
> > On 09/04/2020 23.01, Hubert Tong wrote:
> > > On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]> <mailto:
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>>> wrote:
> > >
> > >
> > > See attached.
> > >
> > > We can only keep preprocessor tokens in the pp-identifier space
> for so long. Any time scanning for object-like macro names or function-like
> macro invocations occur, we would need identifiers in NFC (barring
> implementation short cuts). The same applies to finding parameters in the
> replacement list of a function-like macro. Shifting the place where
> requirements are checked is sufficient to avoid token concatenation
> UB/behavioural surprises.
> >
> > I have trouble relating these comments to my document.
> > The idea of the pp-identifier introduction was to postpone
> > the NFC check until we actually need an /identifier/.
> >
> > Are you saying that we need statements in [cpp.replace] p10 and p12
> > and [cpp.subst] p1 clarifying that we transition from pp-identifier
> > to identifier when identifying macro names and parameter names?
> >
> > Yes.
>
> I've thought a little more about that.
>
> There is clearly an /identifier/ in the preprocessor grammar when
> defining these things, so we're good in that regard.
>
> When we mention a macro or parameter in the program text and its
> spelling is not the same as the /identifier/ in the definition
> (for example, because of non-NFC), we simply don't replace,
> and the thing is (attempted) to be turned into an /identifier/
> in phase 7, with subsequent failure.
>
There's no guarantee that it doesn't make its way into the result of
stringization without triggering a diagnostic. The failure to replace
something that would be an invocation of a macro if the pp-identifier was
run through an NFC normalization process is closer to something that the
paper was supposed to prevent than something that paper should endorse.

>
> Since we need to trial-match every pp-identifier we encounter
> against defined macro names and macro parameters, we'd otherwise
> do the transition during preprocessing all the time, which
> is undesirable for token-pasting:
>
> #define combine(X,Y) X ## Y
> #define stringize(X) # X
> char8_t * s = stringize(combine(A,\u0300));
>
> When we rescan the intermediate macro-replacement result
> stringize(A\u0300)
>
> for more macro replacement, A\u0300 shouldn't be ill-formed
> right there.
>
I'm not convinced that there is sufficient motivation to allow this. I
understand the motivation to side-step UB, but that does not require the
rescan to be happy.

>
> Notes added.
>
> > That sounds reasonable.
> >
> >
> >
> >
> >
> > > Note that a universal-character-name never
> > > represents a member of the basic source
> > > character set, so we don't have to call out
> > > underscores specifically.
> > >
> > > This makes any sequence involving a universal-character-name
> > > a pp-identifier (and thus a preprocessing-token), so that
> > >
> > > #define accent(x) x ## \u0300
> > >
> > > does the right thing.
> > >
> > > I'm fine with pp-identifier as a lexing tool, but I think allowing
> them to start with combining characters or to contain whitespace characters
> outside of the basic source character set is not advisable. That is, we
> should have a rule that validates pp-identifiers after determining the
> characters that it encompasses.
> >
> > But if we don't make pp-identifier very accommodating,
> >
> > #define accent(x) x ## \u0300
> >
> > will never do anything useful, because \u0300 would not be a
> > preprocessing-token, so the preprocessing-token adjacent to ##
> > is just the backslash.
> >
> > It's hard to formulate, in the presence of lexer max-munch,
> > when we continue lexing vs. we stop because some non-grammatical
> > restriction is no longer satisfied. I was trying to avoid that
> > by making the transition pp-identifier -> identifier explicit.
> >
> > I understand the motivation, but I am having a hard time with
> formalizing "invisible" preprocessing tokens that don't "fail fast". These
> really are most interesting in conjunction with token pasting, so I think
> we're looking at either making the above pasting ill-formed (just because
> there is a \u0300 as a pp-identifier)
>
> My guess is that this would be fine, in your view:
>
> #define combine(X,Y) X ## Y
> #define stringize(X) # X
> char8_t * s = stringize(combine(A,\u0300));
>
> So, we're only looking at a special rule that the pp-identifier lexically
> following ## be an identifier, to avoid any source-code level "viewing in
> editor" confusion. But why is ## special in this regard, as opposed to
> (say) the comma in
>
> char8_t * s = stringize(combine(A,\u0300));
>
> where similar confusion might arise?
>
I'm not happy with the comma either.

>
> (This example should find its way into the paper as "ok", btw.)
>
> > or we are going to have that token pasting at the cost of needing to
> deal with pasting characters that are "invisible" or appear to modify the
> appearance "##" itself.
>
> But that "dealing" is only a user-education issue, not a technical
> specification issue. Or am I missing something?
>
This is a design issue. This paper supposedly improves the situation around
Unicode in source code, but it clarifies certain cases as having the "bad
behaviour". That is, it "encourages" the lexing to consider operators and
punctuators are unmodified by Unicode-related mechanisms.

>
> I'm inclined to not modify the wording in this area.
>
> >
> > > Did someone check that UAX #31 really is part of ISO 10646?
> > >
> > > It isn't, but what we need a cross reference to from the non-Annex
> wording is the Unicode Character Database, which is also not part of
> ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual
> technical content--is down; they managed to keep the fluff site up...).
> >
> > It seems we don't really need a normative reference to UAX #31;
> > all we need is a normative reference to the database and a
> > bibliography entry for UAX #31.
> >
> > Agreed.
>
> www.unicode.org is still down, it seems. :-(
>
> Updated wording is attached.
>
> Jens
>

Received on 2020-04-21 21:41:23