sg16: Re: [SG16] Wording for UAX #31 identifiers

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 9 Apr 2020 19:09:33 -0400

I agree that it looks like we don't need a normative reference to UAX#31 as
we are not deferring to it in that way. The Unicode Character Database, UAX
#44 does.

I think an undated reference and a requirement that implementations
document which unicode standard they are using. We have a few proposals in
flight that would require this, anyway.

I think the portability issue will not terrible in practice. There are very
few living languages with characters that will be allowed in identifiers
being added.

On Thu, Apr 9, 2020, 18:34 Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 09/04/2020 23.01, Hubert Tong wrote:
> > On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> >
> > See attached.
> >
> > We can only keep preprocessor tokens in the pp-identifier space for so
> long. Any time scanning for object-like macro names or function-like macro
> invocations occur, we would need identifiers in NFC (barring implementation
> short cuts). The same applies to finding parameters in the replacement list
> of a function-like macro. Shifting the place where requirements are checked
> is sufficient to avoid token concatenation UB/behavioural surprises.
>
> I have trouble relating these comments to my document.
> The idea of the pp-identifier introduction was to postpone
> the NFC check until we actually need an /identifier/.
>
> Are you saying that we need statements in [cpp.replace] p10 and p12
> and [cpp.subst] p1 clarifying that we transition from pp-identifier
> to identifier when identifying macro names and parameter names?
> That sounds reasonable.
>
> > Note that a universal-character-name never
> > represents a member of the basic source
> > character set, so we don't have to call out
> > underscores specifically.
> >
> > This makes any sequence involving a universal-character-name
> > a pp-identifier (and thus a preprocessing-token), so that
> >
> > #define accent(x) x ## \u0300
> >
> > does the right thing.
> >
> > I'm fine with pp-identifier as a lexing tool, but I think allowing them
> to start with combining characters or to contain whitespace characters
> outside of the basic source character set is not advisable. That is, we
> should have a rule that validates pp-identifiers after determining the
> characters that it encompasses.
>
> But if we don't make pp-identifier very accommodating,
>
> #define accent(x) x ## \u0300
>
> will never do anything useful, because \u0300 would not be a
> preprocessing-token, so the preprocessing-token adjacent to ##
> is just the backslash.
>
> It's hard to formulate, in the presence of lexer max-munch,
> when we continue lexing vs. we stop because some non-grammatical
> restriction is no longer satisfied. I was trying to avoid that
> by making the transition pp-identifier -> identifier explicit.
>
> Note that all non-basic characters are represented as
> universal-character-names at that point (whether they
> appeared as such in the original source code or not).
> I think we should never break up \uXXXX during lexing or
> preprocessing.
> > Did someone check that UAX #31 really is part of ISO 10646?
> >
> > It isn't, but what we need a cross reference to from the non-Annex
> wording is the Unicode Character Database, which is also not part of
> ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual
> technical content--is down; they managed to keep the fluff site up...).
>
> It seems we don't really need a normative reference to UAX #31;
> all we need is a normative reference to the database and a
> bibliography entry for UAX #31.
>
> > There should be a cross-reference to the Annex somewhere in [
> lex.name <http://lex.name>].
> >
> > Further concerns:
> >
> > We have a generic reference to ISO 10646 in the front matter
> > of the standard. That means the most recent version applies,
> > implicitly. That's a bit of a moving target, though: Does
> > an implementation lose conformance if a new version of ISO 10646
> > is issued (because more characters are allowed in identifiers in
> > later versions, maybe)?
> >
> > Should we maybe require an implementation to document which
> > revision of ISO 10646 was used for XID_Start and XID_Continue?
> >
> > Sure (although not of ISO/IEC 10646).
> >
> >
> > This way, programmers can at least find out about a
> > portability pitfall.
> >
> >
> > The paper should spend a section on explaining how expensive
> > (code size; maybe performance) a check for NFC is for the compiler.
> > Does the compiler need the entire Unicode tables, or are there
> > shortcuts (e.g. a few ranges of "bad" code points)?
> >
> > Jens
> > --
> > SG16 mailing list
> > SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> >
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-04-09 18:12:38