Date: Fri, 10 Apr 2020 00:34:26 +0200
On 09/04/2020 23.01, Hubert Tong wrote:
> On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
>
> See attached.
>
> We can only keep preprocessor tokens in the pp-identifier space for so long. Any time scanning for object-like macro names or function-like macro invocations occur, we would need identifiers in NFC (barring implementation short cuts). The same applies to finding parameters in the replacement list of a function-like macro. Shifting the place where requirements are checked is sufficient to avoid token concatenation UB/behavioural surprises.
I have trouble relating these comments to my document.
The idea of the pp-identifier introduction was to postpone
the NFC check until we actually need an /identifier/.
Are you saying that we need statements in [cpp.replace] p10 and p12
and [cpp.subst] p1 clarifying that we transition from pp-identifier
to identifier when identifying macro names and parameter names?
That sounds reasonable.
> Note that a universal-character-name never
> represents a member of the basic source
> character set, so we don't have to call out
> underscores specifically.
>
> This makes any sequence involving a universal-character-name
> a pp-identifier (and thus a preprocessing-token), so that
>
> #define accent(x) x ## \u0300
>
> does the right thing.
>
> I'm fine with pp-identifier as a lexing tool, but I think allowing them to start with combining characters or to contain whitespace characters outside of the basic source character set is not advisable. That is, we should have a rule that validates pp-identifiers after determining the characters that it encompasses.
But if we don't make pp-identifier very accommodating,
#define accent(x) x ## \u0300
will never do anything useful, because \u0300 would not be a
preprocessing-token, so the preprocessing-token adjacent to ##
is just the backslash.
It's hard to formulate, in the presence of lexer max-munch,
when we continue lexing vs. we stop because some non-grammatical
restriction is no longer satisfied. I was trying to avoid that
by making the transition pp-identifier -> identifier explicit.
Note that all non-basic characters are represented as
universal-character-names at that point (whether they
appeared as such in the original source code or not).
I think we should never break up \uXXXX during lexing or
preprocessing.
> Did someone check that UAX #31 really is part of ISO 10646?
>
> It isn't, but what we need a cross reference to from the non-Annex wording is the Unicode Character Database, which is also not part of ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual technical content--is down; they managed to keep the fluff site up...).
It seems we don't really need a normative reference to UAX #31;
all we need is a normative reference to the database and a
bibliography entry for UAX #31.
> There should be a cross-reference to the Annex somewhere in [lex.name <http://lex.name>].
>
> Further concerns:
>
> We have a generic reference to ISO 10646 in the front matter
> of the standard. That means the most recent version applies,
> implicitly. That's a bit of a moving target, though: Does
> an implementation lose conformance if a new version of ISO 10646
> is issued (because more characters are allowed in identifiers in
> later versions, maybe)?
>
> Should we maybe require an implementation to document which
> revision of ISO 10646 was used for XID_Start and XID_Continue?
>
> Sure (although not of ISO/IEC 10646).
>
>
> This way, programmers can at least find out about a
> portability pitfall.
>
>
> The paper should spend a section on explaining how expensive
> (code size; maybe performance) a check for NFC is for the compiler.
> Does the compiler need the entire Unicode tables, or are there
> shortcuts (e.g. a few ranges of "bad" code points)?
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
> On Thu, Apr 9, 2020 at 2:15 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
>
> See attached.
>
> We can only keep preprocessor tokens in the pp-identifier space for so long. Any time scanning for object-like macro names or function-like macro invocations occur, we would need identifiers in NFC (barring implementation short cuts). The same applies to finding parameters in the replacement list of a function-like macro. Shifting the place where requirements are checked is sufficient to avoid token concatenation UB/behavioural surprises.
I have trouble relating these comments to my document.
The idea of the pp-identifier introduction was to postpone
the NFC check until we actually need an /identifier/.
Are you saying that we need statements in [cpp.replace] p10 and p12
and [cpp.subst] p1 clarifying that we transition from pp-identifier
to identifier when identifying macro names and parameter names?
That sounds reasonable.
> Note that a universal-character-name never
> represents a member of the basic source
> character set, so we don't have to call out
> underscores specifically.
>
> This makes any sequence involving a universal-character-name
> a pp-identifier (and thus a preprocessing-token), so that
>
> #define accent(x) x ## \u0300
>
> does the right thing.
>
> I'm fine with pp-identifier as a lexing tool, but I think allowing them to start with combining characters or to contain whitespace characters outside of the basic source character set is not advisable. That is, we should have a rule that validates pp-identifiers after determining the characters that it encompasses.
But if we don't make pp-identifier very accommodating,
#define accent(x) x ## \u0300
will never do anything useful, because \u0300 would not be a
preprocessing-token, so the preprocessing-token adjacent to ##
is just the backslash.
It's hard to formulate, in the presence of lexer max-munch,
when we continue lexing vs. we stop because some non-grammatical
restriction is no longer satisfied. I was trying to avoid that
by making the transition pp-identifier -> identifier explicit.
Note that all non-basic characters are represented as
universal-character-names at that point (whether they
appeared as such in the original source code or not).
I think we should never break up \uXXXX during lexing or
preprocessing.
> Did someone check that UAX #31 really is part of ISO 10646?
>
> It isn't, but what we need a cross reference to from the non-Annex wording is the Unicode Character Database, which is also not part of ISO/IEC 10646 (I can't check though, unicode.org--the one with the actual technical content--is down; they managed to keep the fluff site up...).
It seems we don't really need a normative reference to UAX #31;
all we need is a normative reference to the database and a
bibliography entry for UAX #31.
> There should be a cross-reference to the Annex somewhere in [lex.name <http://lex.name>].
>
> Further concerns:
>
> We have a generic reference to ISO 10646 in the front matter
> of the standard. That means the most recent version applies,
> implicitly. That's a bit of a moving target, though: Does
> an implementation lose conformance if a new version of ISO 10646
> is issued (because more characters are allowed in identifiers in
> later versions, maybe)?
>
> Should we maybe require an implementation to document which
> revision of ISO 10646 was used for XID_Start and XID_Continue?
>
> Sure (although not of ISO/IEC 10646).
>
>
> This way, programmers can at least find out about a
> portability pitfall.
>
>
> The paper should spend a section on explaining how expensive
> (code size; maybe performance) a check for NFC is for the compiler.
> Does the compiler need the entire Unicode tables, or are there
> shortcuts (e.g. a few ranges of "bad" code points)?
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2020-04-09 17:37:23