ISOCPP sg16 List: Further thoughts on LWG #3767 (codecvt<charN_t, char8_t, mbstate

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 10 Oct 2022 01:15:28 -0400

While writing the meeting summary for the 2022-09-28 telecon
<https://github.com/sg16-unicode/sg16-meetings#september-28th-2022>, it
occurred to me that the poll taken for LWG #3767 (codecvt<charN_t,
char8_t, mbstate_t> incorrectly added to locale
<https://cplusplus.github.io/LWG/issue3767>) was not completely clear
regarding our intent. The poll (for which we had unanimous consent) was:

*Poll: SG16 agrees that the codecvt facets mentioned in LWG3767
"codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale" are
intended to be invariant with respect to locale.*

Our discussion included the codecvt<charN_t, char, ...> facets in
addition to the char8_t ones but the LWG issue does not mention the
former. The wording of the poll suggests that we only intended it to
apply to the char8_t based facets, but it isn't clear to me whether that
was intended. Regardless, if our conclusion is that the char8_t-based
facets need not be present, then we should presumably take a position on
whether the deprecated facets should be un-deprecated.

Neither the poll nor the discussion are explicit regarding why we
consider these facets to be invariant with respect to locale. I think
there are at least two reasonable interpretations.

1. The codecvt<charN_t, char8_t, ...> facets are invariant with respect
    to locale because the charN_t types are only intended to be used for
    UTF-N code units.
2. The codecvt<charN_t, char8_t, ...> facets are invariant with respect
    to locale because the conversions specified in
    [locale.codecvt.general]p3
    <http://eel.is/c++draft/locale.codecvt#general-3> directly specify
    the encodings involved without deference to locale.

I believe the first interpretation reflects the desired intent.

The second interpretation is problematic. If the rationale is based on
the specified behavior of the codecvt specializations (as distinct from
the facets), then the codecvt<char, char, ...> specialization (which
performs no conversion) and the codecvt<charN_t, char, ...>
specializations (which convert between UTF encodings) are also locale
invariant. That seems clearly not to be the intent.

I now believe that I was in error when writing P0482
<https://wg21.link/p0482>(char8_t: A type for UTF-8 characters and
strings) <https://wg21.link/p0482> in the following three ways (all
based on a mistaken focus on the behavior of the specializations as
opposed to their use as facets):

1. The codecvt<charN_t, char8_t, ...> facets should not have been added
    (based on the rationale for interpretation #1 above).
2. The codecvt<charN_t, char, ...> facets should not have been
    deprecated. The rationale used for their deprecation was that the
    char8_t-based specializations perform the same conversions and that
    the char8_t specializations should be preferred. But this rationale
    failed to appreciate how facet selection occurs (by type, not by
    desired encoding).
3. A codecvt<char8_t, char, ...> facet should have been added and its
    associated specialization should convert between UTF-8 and the
    current locale encoding (for consistency with the wchar_t facet) or
    not convert at all (for consistency with the charN_t facets).

Likewise for the codecvt_byname facets.

Based on the above, I'm inclined towards writing a paper to address the
three errors listed above and, in doing so, resolve the LWG issue.

Please share your thoughts, both with regard to your understanding of
the poll when it was taken and with regard to the suggested direction
above.

Tom.

Received on 2022-10-10 05:15:29