C++ Logo


Advanced search

Re: [SG16] Wording for UAX #31 identifiers

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Fri, 10 Apr 2020 09:29:54 +0200
On 10/04/2020 01.39, Hubert Tong wrote:
> On Thu, Apr 9, 2020 at 6:34 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
> On 09/04/2020 23.01, Hubert Tong wrote:
> > On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]> <mailto:sg16_at_[hidden] <mailto:sg16_at_[hidden]>>> wrote:
> >
> >
> > See attached.
> >
> > We can only keep preprocessor tokens in the pp-identifier space for so long. Any time scanning for object-like macro names or function-like macro invocations occur, we would need identifiers in NFC (barring implementation short cuts). The same applies to finding parameters in the replacement list of a function-like macro. Shifting the place where requirements are checked is sufficient to avoid token concatenation UB/behavioural surprises.
> I have trouble relating these comments to my document.
> The idea of the pp-identifier introduction was to postpone
> the NFC check until we actually need an /identifier/.
> Are you saying that we need statements in [cpp.replace] p10 and p12
> and [cpp.subst] p1 clarifying that we transition from pp-identifier
> to identifier when identifying macro names and parameter names?
> Yes.

I've thought a little more about that.

There is clearly an /identifier/ in the preprocessor grammar when
defining these things, so we're good in that regard.

When we mention a macro or parameter in the program text and its
spelling is not the same as the /identifier/ in the definition
(for example, because of non-NFC), we simply don't replace,
and the thing is (attempted) to be turned into an /identifier/
in phase 7, with subsequent failure.

Since we need to trial-match every pp-identifier we encounter
against defined macro names and macro parameters, we'd otherwise
do the transition during preprocessing all the time, which
is undesirable for token-pasting:

#define combine(X,Y) X ## Y
#define stringize(X) # X
char8_t * s = stringize(combine(A,\u0300));

When we rescan the intermediate macro-replacement result

for more macro replacement, A\u0300 shouldn't be ill-formed
right there.

Notes added.

> That sounds reasonable.
> > Note that a universal-character-name never
> > represents a member of the basic source
> > character set, so we don't have to call out
> > underscores specifically.
> >
> > This makes any sequence involving a universal-character-name
> > a pp-identifier (and thus a preprocessing-token), so that
> >
> > #define accent(x) x ## \u0300
> >
> > does the right thing.
> >
> > I'm fine with pp-identifier as a lexing tool, but I think allowing them to start with combining characters or to contain whitespace characters outside of the basic source character set is not advisable. That is, we should have a rule that validates pp-identifiers after determining the characters that it encompasses.
> But if we don't make pp-identifier very accommodating,
> #define accent(x) x ## \u0300
> will never do anything useful, because \u0300 would not be a
> preprocessing-token, so the preprocessing-token adjacent to ##
> is just the backslash.
> It's hard to formulate, in the presence of lexer max-munch,
> when we continue lexing vs. we stop because some non-grammatical
> restriction is no longer satisfied. I was trying to avoid that
> by making the transition pp-identifier -> identifier explicit.
> I understand the motivation, but I am having a hard time with formalizing "invisible" preprocessing tokens that don't "fail fast". These really are most interesting in conjunction with token pasting, so I think we're looking at either making the above pasting ill-formed (just because there is a \u0300 as a pp-identifier)

My guess is that this would be fine, in your view:

#define combine(X,Y) X ## Y
#define stringize(X) # X
char8_t * s = stringize(combine(A,\u0300));

So, we're only looking at a special rule that the pp-identifier lexically
following ## be an identifier, to avoid any source-code level "viewing in
editor" confusion. But why is ## special in this regard, as opposed to
(say) the comma in

char8_t * s = stringize(combine(A,\u0300));

where similar confusion might arise?

(This example should find its way into the paper as "ok", btw.)

> or we are going to have that token pasting at the cost of needing to deal with pasting characters that are "invisible" or appear to modify the appearance "##" itself.

But that "dealing" is only a user-education issue, not a technical
specification issue. Or am I missing something?

I'm inclined to not modify the wording in this area.

> > Did someone check that UAX #31 really is part of ISO 10646?
> >
> > It isn't, but what we need a cross reference to from the non-Annex wording is the Unicode Character Database, which is also not part of ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual technical content--is down; they managed to keep the fluff site up...).
> It seems we don't really need a normative reference to UAX #31;
> all we need is a normative reference to the database and a
> bibliography entry for UAX #31.
> Agreed.

www.unicode.org is still down, it seems. :-(

Updated wording is attached.


Received on 2020-04-10 02:32:52