C++ Logo


Advanced search

Re: [SG16] Wording for UAX #31 identifiers

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 9 Apr 2020 19:39:57 -0400
On Thu, Apr 9, 2020 at 6:34 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 09/04/2020 23.01, Hubert Tong wrote:
> > On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> >
> > See attached.
> >
> > We can only keep preprocessor tokens in the pp-identifier space for so
> long. Any time scanning for object-like macro names or function-like macro
> invocations occur, we would need identifiers in NFC (barring implementation
> short cuts). The same applies to finding parameters in the replacement list
> of a function-like macro. Shifting the place where requirements are checked
> is sufficient to avoid token concatenation UB/behavioural surprises.
> I have trouble relating these comments to my document.
> The idea of the pp-identifier introduction was to postpone
> the NFC check until we actually need an /identifier/.
> Are you saying that we need statements in [cpp.replace] p10 and p12
> and [cpp.subst] p1 clarifying that we transition from pp-identifier
> to identifier when identifying macro names and parameter names?

> That sounds reasonable.

> > Note that a universal-character-name never
> > represents a member of the basic source
> > character set, so we don't have to call out
> > underscores specifically.
> >
> > This makes any sequence involving a universal-character-name
> > a pp-identifier (and thus a preprocessing-token), so that
> >
> > #define accent(x) x ## \u0300
> >
> > does the right thing.
> >
> > I'm fine with pp-identifier as a lexing tool, but I think allowing them
> to start with combining characters or to contain whitespace characters
> outside of the basic source character set is not advisable. That is, we
> should have a rule that validates pp-identifiers after determining the
> characters that it encompasses.
> But if we don't make pp-identifier very accommodating,
> #define accent(x) x ## \u0300
> will never do anything useful, because \u0300 would not be a
> preprocessing-token, so the preprocessing-token adjacent to ##
> is just the backslash.
> It's hard to formulate, in the presence of lexer max-munch,
> when we continue lexing vs. we stop because some non-grammatical
> restriction is no longer satisfied. I was trying to avoid that
> by making the transition pp-identifier -> identifier explicit.
I understand the motivation, but I am having a hard time with formalizing
"invisible" preprocessing tokens that don't "fail fast". These really are
most interesting in conjunction with token pasting, so I think we're
looking at either making the above pasting ill-formed (just because there
is a \u0300 as a pp-identifier) or we are going to have that token pasting
at the cost of needing to deal with pasting characters that are "invisible"
or appear to modify the appearance "##" itself.

> Note that all non-basic characters are represented as
> universal-character-names at that point (whether they
> appeared as such in the original source code or not).
> I think we should never break up \uXXXX during lexing or
> preprocessing.
I agree we should never break up \uXXXX or \UxxxxYYYY.

> > Did someone check that UAX #31 really is part of ISO 10646?
> >
> > It isn't, but what we need a cross reference to from the non-Annex
> wording is the Unicode Character Database, which is also not part of
> ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual
> technical content--is down; they managed to keep the fluff site up...).
> It seems we don't really need a normative reference to UAX #31;
> all we need is a normative reference to the database and a
> bibliography entry for UAX #31.

Received on 2020-04-09 18:43:10