C++ Logo

sg16

Advanced search

Re: [SG16] Wording for UAX #31 identifiers

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 9 Apr 2020 17:01:46 -0400
On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

>
> See attached.
>
We can only keep preprocessor tokens in the pp-identifier space for so
long. Any time scanning for object-like macro names or function-like macro
invocations occur, we would need identifiers in NFC (barring implementation
short cuts). The same applies to finding parameters in the replacement list
of a function-like macro. Shifting the place where requirements are checked
is sufficient to avoid token concatenation UB/behavioural surprises.


>
> Note that a universal-character-name never
> represents a member of the basic source
> character set, so we don't have to call out
> underscores specifically.
>
> This makes any sequence involving a universal-character-name
> a pp-identifier (and thus a preprocessing-token), so that
>
> #define accent(x) x ## \u0300
>
> does the right thing.
>
I'm fine with pp-identifier as a lexing tool, but I think allowing them to
start with combining characters or to contain whitespace characters outside
of the basic source character set is not advisable. That is, we should have
a rule that validates pp-identifiers after determining the characters that
it encompasses.


>
> Did someone check that UAX #31 really is part of ISO 10646?
>
It isn't, but what we need a cross reference to from the non-Annex wording
is the Unicode Character Database, which is also not part of ISO/IEC 10646
(I can't check though, unicode.org--the one with the actual technical
content--is down; they managed to keep the fluff site up...).


> There should be a cross-reference to the Annex somewhere in [lex.name].
>
> Further concerns:
>
> We have a generic reference to ISO 10646 in the front matter
> of the standard. That means the most recent version applies,
> implicitly. That's a bit of a moving target, though: Does
> an implementation lose conformance if a new version of ISO 10646
> is issued (because more characters are allowed in identifiers in
> later versions, maybe)?
>
> Should we maybe require an implementation to document which
> revision of ISO 10646 was used for XID_Start and XID_Continue?
>
Sure (although not of ISO/IEC 10646).


> This way, programmers can at least find out about a
> portability pitfall.
>
>
> The paper should spend a section on explaining how expensive
> (code size; maybe performance) a check for NFC is for the compiler.
> Does the compiler need the entire Unicode tables, or are there
> shortcuts (e.g. a few ranges of "bad" code points)?
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-04-09 16:04:59