If you run the preprocessor and preserve the universal-character-name conversion all the characters you see will be in the basic source set, and non-identifiers will match the regex. Direct unicode source code is becoming far more portable. Regex engines that don't support unicode classifications are going to become less useful. 

On Thu, Jun 18, 2020 at 11:26 AM JF Bastien via Ext <ext@lists.isocpp.org> wrote:

On Thu, Jun 18, 2020 at 8:24 AM Corentin Jabot <corentinjabot@gmail.com> wrote:

On Thu, 18 Jun 2020 at 17:08, Matthew Woehlke <mwoehlke.floss@gmail.com> wrote:
On 18/06/2020 10.46, JF Bastien wrote:
> On Thu, Jun 18, 2020 at 7:44 AM Tom Honermann wrote:
>> On 6/18/20 10:33 AM, Matthew Woehlke via Ext wrote:
>>> Okay, maybe not, but then I suppose my point is that if we're going to fix
>>> it, I would like to *fix* it, not just make it less broken.
>> What particular form of "*fix*" do you have in mind?

I believe I already explained that. To repeat, make identifiers conform
to '[_[:alpha:]][_[:alnum:]]*'.

> I'd like to understand what is "broken" first :-)
> Escaping characters?
> Or something about tools which try to naively process C++ code? i.e. are we
> trying to make naive tools easier?

That depends on your definition of "easier". The goal isn't so much to
make it easier to write a tool correctly, but to make it so that
*existing* tools¹ are correct w.r.t. the standard.

Note that "tools" here includes humans. At least for me, the above
definition is muscle memory (and also very, very easy to type; usually
as '\w+', ignoring that this will catch stuff like '9to5' since such
false positives are rare).

The alternative is to convince every text editor, text tool² and text
processing library in existence that '\w' is '\p{XID_Continue}' and not
'[_[:alnum:]]' as it is currently defined (by, AFAIK, *everyone*).

I would challenge anyone to show me an existing tool³ which uses the
proposed definition of identifiers. I can name a good half dozen, just
off the top of my head, that use *my* proposed definition.

I'm puzzled by your use case. How often do you use a regex to find identifiers?
And which tools do that?

FWIW, you have to run the preprocessor before running the regex.

(¹ I'll assume use of a Unicode-correct definition of '[[:alnum:]]'. For
tools that get that wrong, I'm happy to label the tool "broken".)

(² *cough*grep*cough*)

(³ Given the paper, it would seem like even compilers probably don't use
the proposal, but anyway, name some non-compiler tools...)

Ext mailing list
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/ext
Link to this post: http://lists.isocpp.org/ext/2020/06/14268.php