On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:

See attached.
We can only keep preprocessor tokens in the pp-identifier space for so long. Any time scanning for object-like macro names or function-like macro invocations occur, we would need identifiers in NFC (barring implementation short cuts). The same applies to finding parameters in the replacement list of a function-like macro. Shifting the place where requirements are checked is sufficient to avoid token concatenation UB/behavioural surprises.
 

Note that a universal-character-name never
represents a member of the basic source
character set, so we don't have to call out
underscores specifically.

This makes any sequence involving a universal-character-name
a pp-identifier (and thus a preprocessing-token), so that

#define accent(x) x ## \u0300

does the right thing.
I'm fine with pp-identifier as a lexing tool, but I think allowing them to start with combining characters or to contain whitespace characters outside of the basic source character set is not advisable. That is, we should have a rule that validates pp-identifiers after determining the characters that it encompasses.
 

Did someone check that UAX #31 really is part of ISO 10646?
It isn't, but what we need a cross reference to from the non-Annex wording is the Unicode Character Database, which is also not part of ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual technical content--is down; they managed to keep the fluff site up...).
 
There should be a cross-reference to the Annex somewhere in [lex.name].

Further concerns:

We have a generic reference to ISO 10646 in the front matter
of the standard. That means the most recent version applies,
implicitly.  That's a bit of a moving target, though: Does
an implementation lose conformance if a new version of ISO 10646
is issued (because more characters are allowed in identifiers in
later versions, maybe)?

Should we maybe require an implementation to document which
revision of ISO 10646 was used for XID_Start and XID_Continue?
Sure (although not of ISO/IEC 10646).
 
This way, programmers can at least find out about a
portability pitfall.


The paper should spend a section on explaining how expensive
(code size; maybe performance) a check for NFC is for the compiler.
Does the compiler need the entire Unicode tables, or are there
shortcuts (e.g. a few ranges of "bad" code points)?

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16