C++ Logo

std-proposals

Advanced search

Re: Enhancement of std::regex

From: Lyberta <lyberta_at_[hidden]>
Date: Tue, 30 Jul 2019 14:33:00 +0000
Nozomu Katō:
> The proposed syntax option does not support char or wchar_t.

It substitutes char and wchar_t with charN_t which is wrong. "char"
usually means a character in execution character set. Unicode doesn't
really have a concept of a character. The closest thing is grapheme
cluster which is composed of 1 or more scalar values. Therefore, working
with Unicode requires different logic.

> std::regex of C++ references RegExp of ECMAScript since the beginning
> and my proposal is intended to import the new features of RegExp.

std::regex is broken and there is no good way to fix it without a redesign.

> As of
> now RegExp does not support anything beyond the code point, such as the
> Unicode normalization and the grapheme clusters.

Which makes it pretty useless then. Code points carry little semantics
on their own in general case. By the way, what is the exact standard
name of RegExp?

> If RegExp supports such features someday, I or someone might try to
> re-enhance std::regex in accordance with a revised spec of RegExp. But
> even if such features are supported, need for searching based on code
> point would remain, because it is likely to be very slow to do searching
> with doing normalization and/or considering grapheme clusters.

Doing searching on grapheme clusters shouldn't be really that hard. You
do a couple of queries to UCD for every scalar value and do a bit of
logic to determine if grapheme cluster has ended. Scalar value access
OTOH is the expert only feature.

Normalization is not needed at all with regex.


Received on 2019-07-30 09:35:43