ISOCPP sg16 List: Re: Further thoughts on LWG #3767 (codecvt<charN_t, char8_t, mbstate

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 10 Oct 2022 09:42:17 +0200

On Mon, Oct 10, 2022, 07:15 Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> While writing the meeting summary for the 2022-09-28 telecon
> <https://github.com/sg16-unicode/sg16-meetings#september-28th-2022>, it
> occurred to me that the poll taken for LWG #3767 (codecvt<charN_t,
> char8_t, mbstate_t> incorrectly added to locale
> <https://cplusplus.github.io/LWG/issue3767>) was not completely clear
> regarding our intent. The poll (for which we had unanimous consent) was:
>
> *Poll: SG16 agrees that the codecvt facets mentioned in LWG3767
> "codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale" are
> intended to be invariant with respect to locale.*
>
> Our discussion included the codecvt<charN_t, char, ...> facets in
> addition to the char8_t ones but the LWG issue does not mention the
> former. The wording of the poll suggests that we only intended it to apply
> to the char8_t based facets, but it isn't clear to me whether that was
> intended. Regardless, if our conclusion is that the char8_t-based facets
> need not be present, then we should presumably take a position on whether
> the deprecated facets should be un-deprecated.
>
> Neither the poll nor the discussion are explicit regarding why we consider
> these facets to be invariant with respect to locale. I think there are at
> least two reasonable interpretations.
>
> 1. The codecvt<charN_t, char8_t, ...> facets are invariant with
> respect to locale because the charN_t types are only intended to be
> used for UTF-N code units.
> 2. The codecvt<charN_t, char8_t, ...> facets are invariant with
> respect to locale because the conversions specified in
> [locale.codecvt.general]p3
> <http://eel.is/c++draft/locale.codecvt#general-3> directly specify the
> encodings involved without deference to locale.
>
> I believe the first interpretation reflects the desired intent.
>
> The second interpretation is problematic. If the rationale is based on the
> specified behavior of the codecvt specializations (as distinct from the
> facets), then the codecvt<char, char, ...> specialization (which performs
> no conversion) and the codecvt<charN_t, char, ...> specializations (which
> convert between UTF encodings) are also locale invariant. That seems
> clearly not to be the intent.
>
> I now believe that I was in error when writing P0482
> <https://wg21.link/p0482> (char8_t: A type for UTF-8 characters and
> strings) <https://wg21.link/p0482> in the following three ways (all based
> on a mistaken focus on the behavior of the specializations as opposed to
> their use as facets):
>
> 1. The codecvt<charN_t, char8_t, ...> facets should not have been
> added (based on the rationale for interpretation #1 above).
> 2. The codecvt<charN_t, char, ...> facets should not have been
> deprecated. The rationale used for their deprecation was that the
> char8_t-based specializations perform the same conversions and that
> the char8_t specializations should be preferred. But this rationale
> failed to appreciate how facet selection occurs (by type, not by desired
> encoding).
> 3. A codecvt<char8_t, char, ...> facet should have been added and its
> associated specialization should convert between UTF-8 and the current
> locale encoding (for consistency with the wchar_t facet) or not
> convert at all (for consistency with the charN_t facets).
>
> Likewise for the codecvt_byname facets.
>
> Based on the above, I'm inclined towards writing a paper to address the
> three errors listed above and, in doing so, resolve the LWG issue.
>
> Please share your thoughts, both with regard to your understanding of the
> poll when it was taken and with regard to the suggested direction above.
>

Agreed on 1 but I believe victor's resolutions gets us there.

2/3... Given our time is limited, I'd rather we focus on Jeanheyd's paper.
It's a critical facility for which usability is important.
codecvt may not be the thing we want to invest ressources into (that's the
nicest thing i can say about this facility).
3 is also lossy so not usually a desirable operation.

Either way, i would hate for 1 to be tied to 2 and 3.

I believe 1 should be treated now so that it's part of c++23. It's an easy
fix. We'd avoid tempting hyrum's law for 3 more years...

I have further thoughts on the conflation of locales and encodings but it
ultimately doesn't matter here. I'm afraid we won't get past that until
serious work is done to replace std::locale (which i don't see happening
anytime soon).

Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-10-10 07:42:30