C++ Logo

std-discussion

Advanced search

Problem of the type requirement of the regex library

From: Xie He <hexie3605_at_[hidden]>
Date: Mon, 23 Dec 2019 23:47:36 -0800
I may have found a issue in the regex library of the standard library.
Can you please help me check if this is a issue and, if it is, whether
it has been reported? Thanks!

I'm looking at the latest C++ standard draft "n4842".

In 30.1 [re.general] Paragraph 2, it says the regex library should be
able to handle "char-like" types. And "char-like" types are defined in
21.1 [strings.general] Paragraph 1 to be non-array trivial
standard-layout types. That means types such as the following:

    struct MyChar { int v; };

should be able to be handled by the regex library. However, when we
actually want to use such types in the regex library, there will be
problems.

1. The standard text uses comparison symbols (==, !=, <, <=, >, <=) to
compare MyChar with each other.

   For example:

     Table 133 [tab:re.req]: The rows about "v.translate(c)" and
     "v.translate_nocase(c)" use "=="

     30.7 [re.traits]: Paragraph 12: "... then the result is
     determined as if by: ... (c == ct.widen('_'))"

     30.13 [re.grammar]: Paragraph 14.1.3: "c == d"; Paragraph 14.2:
     "c1 <= c && c <= c2"

   There are two possible interpretations:

     1. These symbols (==, !=, <, <=, >, <=) are the comparison
     operators of the type MyChar.

     2. These symbols (==, !=, <, <=, >, <=) should be implemented
     with std::char_traits<MyChar>::eq and
     std::char_traits<MyChar>::lt, where std::char_traits<MyChar> can
     be a user-provided specialization.

   The first interpretation gives us these problems:

     1. The type MyChar has to have the comparison operators between
     itself defined.

     2. The regex library standard text also compares strings using
     these comparison operators.

        For example:

          Table 133 [tab:re.req]: The rows about "v.transform(F1, F2)"
          and "v.transform_primary(F1, F2)"

          30.13 [re.grammar]: Paragraph 14.2: the return statement
          used "<="

        According to the string library standard, the comparisons
        eventually are implemented with
        std::char_traits<MyChar>::compare, which according the
        standard is implemented with std::char_traits<MyChar>::eq and
        std::char_traits<MyChar>::lt.

        This causes inconsistency between character comparisons and
        string comparisons.

   The second interpretation gives us no problems but it is not
   explicitly specified by the standard and it is not how current
   libraries are implemented.

2. The standard text uses the equality symbol (==) to compare MyChar
with integer 0.

   It is at:

     Table 133 [tab:re.req]: The row about "X::length(p)": "Yields the
     smallest i such that p[i] == 0"

   There are two possible interpretations:

     1. "==" is the comparison operators of the type MyChar and type
     int.

     2. "== 0" should be implemented with
     std::char_traits<MyChar>::eq(c, MyChar()), where
     std::char_traits<MyChar> can be a user-provided specialization.

   The first interpretation gives us these problems:

     1. The type MyChar has to have "==" between it and type int
     defined.

     2. In the regex library, the regex traits class requirement (30.3
     [re.req]) says "X::length(p)" should "yield the smallest i such
     that p[i] == 0". However, in the description of the
     "std::regex_traits" class template (30.7 [re.traits]), the
     "length(p)" function is defined to be:
     "char_traits<MyChar>::length(p)", which according to the standard
     is defined as "the smallest i such that
     std::char_traits<MyChar>::eq(p[i], MyChar())". And in 30.3
     [re.req] it says the class template defined in (30.7 [re.traits])
     satisfies its requirements. This means the standard thinks that
     "c == 0" and "std::char_traits<MyChar>::eq(c, MyChar())" are
     equivalent to each other, which is not guaranteed to be true.

   The second interpretation gives us no problems but it is not
   explicitly specified by the standard.

3. MyChar has to have a way to compare with or convert to/from other
character types, otherwise it would be not possible to recognize regex
syntactical characters, such as '*', '+', '\n', L'\u2028',
L'\u2029'...

   In 30.13 [re.grammar] Paragraph 2, it says the regex traits
   template parameter as defined in 30.3 [re.req] should provide
   localization and basic_regex member functions shall not call any
   locale dependent C or C ++ API. Instead they shall call the
   appropriate traits member function.

   However the requirements of the regex traits class (30.3 [re.req])
   doesn't provide any means to convert between character types. The
   "locale_type" is only required to be "copy constructible" and
   doesn't need to be std::locale, so things like
   "use_facet<ctype<MyChar>>(getloc()).widen" or
   "use_facet<codecvt<MyChar, char, std::mbstate_t>>(getloc())" may
   not work.

   There is no easy way to fix this problem, the simplest way is to
   directly cast between MyChar and other character types. However,
   this gives us the following problems.

     1. The type MyChar should be able to cast to/from other character
     types.

     2. This may cause inconsistent results when MyChar objects are
     compared directly with each other and with 0, and when MyChar
     objects are first converted to other character types and then
     compared.

Proposed solution:

In 30.1 [re.general] Paragraph 2, change "char-like template
arguments" to "template arguments that are integral types that encode
Unicode code point values, with char_traits<the type> having its
two-parameter assign, eq and lt functions provide identical results as
the =, ==, < operators, for all valid code points the type supports".

Received on 2019-12-24 01:50:16