sg16: Re: [SG16] Wording for UAX #31 identifiers

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 14 Apr 2020 21:56:50 +0200

On 14/04/2020 21.41, Steve Downey wrote:
> Jens, this is a fairly significant piece of work. Thank's a lot! Would you wish to be credited as an author?

Sure, why not, thanks.

Jens

>
> On Fri, Apr 10, 2020 at 3:30 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 10/04/2020 01.39, Hubert Tong wrote:
> > On Thu, Apr 9, 2020 at 6:34 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
> >
> > On 09/04/2020 23.01, Hubert Tong wrote:
> > > On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]> <mailto:sg16_at_[hidden] <mailto:sg16_at_[hidden]>> <mailto:sg16_at_[hidden] <mailto:sg16_at_[hidden]> <mailto:sg16_at_[hidden] <mailto:sg16_at_[hidden]>>>> wrote:
> > >
> > >
> > > See attached.
> > >
> > > We can only keep preprocessor tokens in the pp-identifier space for so long. Any time scanning for object-like macro names or function-like macro invocations occur, we would need identifiers in NFC (barring implementation short cuts). The same applies to finding parameters in the replacement list of a function-like macro. Shifting the place where requirements are checked is sufficient to avoid token concatenation UB/behavioural surprises.
> >
> > I have trouble relating these comments to my document.
> > The idea of the pp-identifier introduction was to postpone
> > the NFC check until we actually need an /identifier/.
> >
> > Are you saying that we need statements in [cpp.replace] p10 and p12
> > and [cpp.subst] p1 clarifying that we transition from pp-identifier
> > to identifier when identifying macro names and parameter names?
> >
> > Yes.
>
> I've thought a little more about that.
>
> There is clearly an /identifier/ in the preprocessor grammar when
> defining these things, so we're good in that regard.
>
> When we mention a macro or parameter in the program text and its
> spelling is not the same as the /identifier/ in the definition
> (for example, because of non-NFC), we simply don't replace,
> and the thing is (attempted) to be turned into an /identifier/
> in phase 7, with subsequent failure.
>
> Since we need to trial-match every pp-identifier we encounter
> against defined macro names and macro parameters, we'd otherwise
> do the transition during preprocessing all the time, which
> is undesirable for token-pasting:
>
> #define combine(X,Y) X ## Y
> #define stringize(X) # X
> char8_t * s = stringize(combine(A,\u0300));
>
> When we rescan the intermediate macro-replacement result
> stringize(A\u0300)
>
> for more macro replacement, A\u0300 shouldn't be ill-formed
> right there.
>
> Notes added.
>
> > That sounds reasonable.
> >
> >
> >
> >
> >
> > > Note that a universal-character-name never
> > > represents a member of the basic source
> > > character set, so we don't have to call out
> > > underscores specifically.
> > >
> > > This makes any sequence involving a universal-character-name
> > > a pp-identifier (and thus a preprocessing-token), so that
> > >
> > > #define accent(x) x ## \u0300
> > >
> > > does the right thing.
> > >
> > > I'm fine with pp-identifier as a lexing tool, but I think allowing them to start with combining characters or to contain whitespace characters outside of the basic source character set is not advisable. That is, we should have a rule that validates pp-identifiers after determining the characters that it encompasses.
> >
> > But if we don't make pp-identifier very accommodating,
> >
> > #define accent(x) x ## \u0300
> >
> > will never do anything useful, because \u0300 would not be a
> > preprocessing-token, so the preprocessing-token adjacent to ##
> > is just the backslash.
> >
> > It's hard to formulate, in the presence of lexer max-munch,
> > when we continue lexing vs. we stop because some non-grammatical
> > restriction is no longer satisfied. I was trying to avoid that
> > by making the transition pp-identifier -> identifier explicit.
> >
> > I understand the motivation, but I am having a hard time with formalizing "invisible" preprocessing tokens that don't "fail fast". These really are most interesting in conjunction with token pasting, so I think we're looking at either making the above pasting ill-formed (just because there is a \u0300 as a pp-identifier)
>
> My guess is that this would be fine, in your view:
>
> #define combine(X,Y) X ## Y
> #define stringize(X) # X
> char8_t * s = stringize(combine(A,\u0300));
>
> So, we're only looking at a special rule that the pp-identifier lexically
> following ## be an identifier, to avoid any source-code level "viewing in
> editor" confusion. But why is ## special in this regard, as opposed to
> (say) the comma in
>
> char8_t * s = stringize(combine(A,\u0300));
>
> where similar confusion might arise?
>
> (This example should find its way into the paper as "ok", btw.)
>
> > or we are going to have that token pasting at the cost of needing to deal with pasting characters that are "invisible" or appear to modify the appearance "##" itself.
>
> But that "dealing" is only a user-education issue, not a technical
> specification issue. Or am I missing something?
>
> I'm inclined to not modify the wording in this area.
>
> >
> > > Did someone check that UAX #31 really is part of ISO 10646?
> > >
> > > It isn't, but what we need a cross reference to from the non-Annex wording is the Unicode Character Database, which is also not part of ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual technical content--is down; they managed to keep the fluff site up...).
> >
> > It seems we don't really need a normative reference to UAX #31;
> > all we need is a normative reference to the database and a
> > bibliography entry for UAX #31.
> >
> > Agreed.
>
> www.unicode.org <http://www.unicode.org> is still down, it seems. :-(
>
> Updated wording is attached.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-04-14 14:59:48