Date: Mon, 23 Dec 2019 23:47:36 -0800
I may have found a issue in the regex library of the standard library.
Can you please help me check if this is a issue and, if it is, whether
it has been reported? Thanks!
I'm looking at the latest C++ standard draft "n4842".
In 30.1 [re.general] Paragraph 2, it says the regex library should be
able to handle "char-like" types. And "char-like" types are defined in
21.1 [strings.general] Paragraph 1 to be non-array trivial
standard-layout types. That means types such as the following:
struct MyChar { int v; };
should be able to be handled by the regex library. However, when we
actually want to use such types in the regex library, there will be
problems.
1. The standard text uses comparison symbols (==, !=, <, <=, >, <=) to
compare MyChar with each other.
For example:
Table 133 [tab:re.req]: The rows about "v.translate(c)" and
"v.translate_nocase(c)" use "=="
30.7 [re.traits]: Paragraph 12: "... then the result is
determined as if by: ... (c == ct.widen('_'))"
30.13 [re.grammar]: Paragraph 14.1.3: "c == d"; Paragraph 14.2:
"c1 <= c && c <= c2"
There are two possible interpretations:
1. These symbols (==, !=, <, <=, >, <=) are the comparison
operators of the type MyChar.
2. These symbols (==, !=, <, <=, >, <=) should be implemented
with std::char_traits<MyChar>::eq and
std::char_traits<MyChar>::lt, where std::char_traits<MyChar> can
be a user-provided specialization.
The first interpretation gives us these problems:
1. The type MyChar has to have the comparison operators between
itself defined.
2. The regex library standard text also compares strings using
these comparison operators.
For example:
Table 133 [tab:re.req]: The rows about "v.transform(F1, F2)"
and "v.transform_primary(F1, F2)"
30.13 [re.grammar]: Paragraph 14.2: the return statement
used "<="
According to the string library standard, the comparisons
eventually are implemented with
std::char_traits<MyChar>::compare, which according the
standard is implemented with std::char_traits<MyChar>::eq and
std::char_traits<MyChar>::lt.
This causes inconsistency between character comparisons and
string comparisons.
The second interpretation gives us no problems but it is not
explicitly specified by the standard and it is not how current
libraries are implemented.
2. The standard text uses the equality symbol (==) to compare MyChar
with integer 0.
It is at:
Table 133 [tab:re.req]: The row about "X::length(p)": "Yields the
smallest i such that p[i] == 0"
There are two possible interpretations:
1. "==" is the comparison operators of the type MyChar and type
int.
2. "== 0" should be implemented with
std::char_traits<MyChar>::eq(c, MyChar()), where
std::char_traits<MyChar> can be a user-provided specialization.
The first interpretation gives us these problems:
1. The type MyChar has to have "==" between it and type int
defined.
2. In the regex library, the regex traits class requirement (30.3
[re.req]) says "X::length(p)" should "yield the smallest i such
that p[i] == 0". However, in the description of the
"std::regex_traits" class template (30.7 [re.traits]), the
"length(p)" function is defined to be:
"char_traits<MyChar>::length(p)", which according to the standard
is defined as "the smallest i such that
std::char_traits<MyChar>::eq(p[i], MyChar())". And in 30.3
[re.req] it says the class template defined in (30.7 [re.traits])
satisfies its requirements. This means the standard thinks that
"c == 0" and "std::char_traits<MyChar>::eq(c, MyChar())" are
equivalent to each other, which is not guaranteed to be true.
The second interpretation gives us no problems but it is not
explicitly specified by the standard.
3. MyChar has to have a way to compare with or convert to/from other
character types, otherwise it would be not possible to recognize regex
syntactical characters, such as '*', '+', '\n', L'\u2028',
L'\u2029'...
In 30.13 [re.grammar] Paragraph 2, it says the regex traits
template parameter as defined in 30.3 [re.req] should provide
localization and basic_regex member functions shall not call any
locale dependent C or C ++ API. Instead they shall call the
appropriate traits member function.
However the requirements of the regex traits class (30.3 [re.req])
doesn't provide any means to convert between character types. The
"locale_type" is only required to be "copy constructible" and
doesn't need to be std::locale, so things like
"use_facet<ctype<MyChar>>(getloc()).widen" or
"use_facet<codecvt<MyChar, char, std::mbstate_t>>(getloc())" may
not work.
There is no easy way to fix this problem, the simplest way is to
directly cast between MyChar and other character types. However,
this gives us the following problems.
1. The type MyChar should be able to cast to/from other character
types.
2. This may cause inconsistent results when MyChar objects are
compared directly with each other and with 0, and when MyChar
objects are first converted to other character types and then
compared.
Proposed solution:
In 30.1 [re.general] Paragraph 2, change "char-like template
arguments" to "template arguments that are integral types that encode
Unicode code point values, with char_traits<the type> having its
two-parameter assign, eq and lt functions provide identical results as
the =, ==, < operators, for all valid code points the type supports".
Can you please help me check if this is a issue and, if it is, whether
it has been reported? Thanks!
I'm looking at the latest C++ standard draft "n4842".
In 30.1 [re.general] Paragraph 2, it says the regex library should be
able to handle "char-like" types. And "char-like" types are defined in
21.1 [strings.general] Paragraph 1 to be non-array trivial
standard-layout types. That means types such as the following:
struct MyChar { int v; };
should be able to be handled by the regex library. However, when we
actually want to use such types in the regex library, there will be
problems.
1. The standard text uses comparison symbols (==, !=, <, <=, >, <=) to
compare MyChar with each other.
For example:
Table 133 [tab:re.req]: The rows about "v.translate(c)" and
"v.translate_nocase(c)" use "=="
30.7 [re.traits]: Paragraph 12: "... then the result is
determined as if by: ... (c == ct.widen('_'))"
30.13 [re.grammar]: Paragraph 14.1.3: "c == d"; Paragraph 14.2:
"c1 <= c && c <= c2"
There are two possible interpretations:
1. These symbols (==, !=, <, <=, >, <=) are the comparison
operators of the type MyChar.
2. These symbols (==, !=, <, <=, >, <=) should be implemented
with std::char_traits<MyChar>::eq and
std::char_traits<MyChar>::lt, where std::char_traits<MyChar> can
be a user-provided specialization.
The first interpretation gives us these problems:
1. The type MyChar has to have the comparison operators between
itself defined.
2. The regex library standard text also compares strings using
these comparison operators.
For example:
Table 133 [tab:re.req]: The rows about "v.transform(F1, F2)"
and "v.transform_primary(F1, F2)"
30.13 [re.grammar]: Paragraph 14.2: the return statement
used "<="
According to the string library standard, the comparisons
eventually are implemented with
std::char_traits<MyChar>::compare, which according the
standard is implemented with std::char_traits<MyChar>::eq and
std::char_traits<MyChar>::lt.
This causes inconsistency between character comparisons and
string comparisons.
The second interpretation gives us no problems but it is not
explicitly specified by the standard and it is not how current
libraries are implemented.
2. The standard text uses the equality symbol (==) to compare MyChar
with integer 0.
It is at:
Table 133 [tab:re.req]: The row about "X::length(p)": "Yields the
smallest i such that p[i] == 0"
There are two possible interpretations:
1. "==" is the comparison operators of the type MyChar and type
int.
2. "== 0" should be implemented with
std::char_traits<MyChar>::eq(c, MyChar()), where
std::char_traits<MyChar> can be a user-provided specialization.
The first interpretation gives us these problems:
1. The type MyChar has to have "==" between it and type int
defined.
2. In the regex library, the regex traits class requirement (30.3
[re.req]) says "X::length(p)" should "yield the smallest i such
that p[i] == 0". However, in the description of the
"std::regex_traits" class template (30.7 [re.traits]), the
"length(p)" function is defined to be:
"char_traits<MyChar>::length(p)", which according to the standard
is defined as "the smallest i such that
std::char_traits<MyChar>::eq(p[i], MyChar())". And in 30.3
[re.req] it says the class template defined in (30.7 [re.traits])
satisfies its requirements. This means the standard thinks that
"c == 0" and "std::char_traits<MyChar>::eq(c, MyChar())" are
equivalent to each other, which is not guaranteed to be true.
The second interpretation gives us no problems but it is not
explicitly specified by the standard.
3. MyChar has to have a way to compare with or convert to/from other
character types, otherwise it would be not possible to recognize regex
syntactical characters, such as '*', '+', '\n', L'\u2028',
L'\u2029'...
In 30.13 [re.grammar] Paragraph 2, it says the regex traits
template parameter as defined in 30.3 [re.req] should provide
localization and basic_regex member functions shall not call any
locale dependent C or C ++ API. Instead they shall call the
appropriate traits member function.
However the requirements of the regex traits class (30.3 [re.req])
doesn't provide any means to convert between character types. The
"locale_type" is only required to be "copy constructible" and
doesn't need to be std::locale, so things like
"use_facet<ctype<MyChar>>(getloc()).widen" or
"use_facet<codecvt<MyChar, char, std::mbstate_t>>(getloc())" may
not work.
There is no easy way to fix this problem, the simplest way is to
directly cast between MyChar and other character types. However,
this gives us the following problems.
1. The type MyChar should be able to cast to/from other character
types.
2. This may cause inconsistent results when MyChar objects are
compared directly with each other and with 0, and when MyChar
objects are first converted to other character types and then
compared.
Proposed solution:
In 30.1 [re.general] Paragraph 2, change "char-like template
arguments" to "template arguments that are integral types that encode
Unicode code point values, with char_traits<the type> having its
two-parameter assign, eq and lt functions provide identical results as
the =, ==, < operators, for all valid code points the type supports".
Received on 2019-12-24 01:50:16