C++ Logo


Advanced search

Re: Help request: regex for pp-number with XID_Continue

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 14 Dec 2022 11:50:34 -0500
On 12/12/22 5:15 PM, will wray via SG16 wrote:
> Fixing an issue in the pcpp pure Python C preprocessor
> I'm out of my depth in unicode regex (and preprocessing)
> I'm proposing to add PP_NUMBER using the regex below to accept
> pp-number with identifier-continue = digit + non-digit
> '\.?\d(?:\.|[\w_]|'[\w_]|[eEpP][-+])*'
> https://eel.is/c++draft/lex.ppnumber#ntref:pp-number
> then the linked parsing spec for identifier-continue includes
> > 'an element of the translation character set of class XID_Continue'
> I guess the regex should be extended to parse XID_Continue.
> Does anyone here have a clue how to do that?
> (With Python re as the regex engine.)

A regex engine that supports UTS#18 (Unicode Regular Expressions)
<https://unicode.org/reports/tr18/> rule RL2.7
<https://unicode.org/reports/tr18/#RL2.7> would allow matching a
character with the XID_Continue property with the syntax
\p{XID_Continue}. Unicode Utilities example
Note that the example matches the identifier (but not the mathematical
character nor the clown faces).

I don't think the Python regex engine supports that though, so you'll
have to find another regex engine or another method to perform the match.

Clang is also going to allow <https://reviews.llvm.org/D137051> the
mathematical profile specified in L2/22-230
<https://www.unicode.org/L2/L2022/22230-math-profile.pdf> in
identifiers. C++ may eventually allow them as well.

> Any other pp-token that should be considered for unicodification?
Some other tokens incorporate an identifier. /user-defined-literal/
<http://eel.is/c++draft/lex.ext#nt:user-defined-literal> does.
> Thanks for any pointers.
> --------
> Bonus follow on preprocessing question (not unicode related):
> For a pure preprocessor, divorced from a C or C++ compiler,
> is there any need for cpp-integer and cpp-float tokens?

Yes; in conditional inclusion directives, but only for
/integer-literal/; /floating-point-literal/ values cannot be queried in
such directives.


> My PR fix entirely removes their CPP_INTEGER and CPPFLOAT
> tokens, which appear to be superfluous in a pure preprocessor up to
> phase 4; all tests pass and the PR fixes my issue and another issue.
> The evaluator for #if conditional expressions doesn't do cpp-tokens
> tokenization, which would've been the only place the CPP_ tokens
> might have been needed, I think...

Received on 2022-12-14 16:50:37