ISOCPP sg16 List: Re: Further thoughts on LWG #3767 (codecvt<charN_t, char8_t, mbstate

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 10 Oct 2022 18:20:53 -0400

On 10/10/22 3:42 AM, Corentin Jabot wrote:
>
>
> On Mon, Oct 10, 2022, 07:15 Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> While writing the meeting summary for the 2022-09-28 telecon
> <https://github.com/sg16-unicode/sg16-meetings#september-28th-2022>,
> it occurred to me that the poll taken for LWG #3767
> (codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale
> <https://cplusplus.github.io/LWG/issue3767>) was not completely
> clear regarding our intent. The poll (for which we had unanimous
> consent) was:
>
> *Poll: SG16 agrees that the codecvt facets mentioned in LWG3767
> "codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale"
> are intended to be invariant with respect to locale.*
>
> Our discussion included the codecvt<charN_t, char, ...> facets in
> addition to the char8_t ones but the LWG issue does not mention
> the former. The wording of the poll suggests that we only intended
> it to apply to the char8_t based facets, but it isn't clear to me
> whether that was intended. Regardless, if our conclusion is that
> the char8_t-based facets need not be present, then we should
> presumably take a position on whether the deprecated facets should
> be un-deprecated.
>
> Neither the poll nor the discussion are explicit regarding why we
> consider these facets to be invariant with respect to locale. I
> think there are at least two reasonable interpretations.
>
> 1. The codecvt<charN_t, char8_t, ...> facets are invariant with
> respect to locale because the charN_t types are only intended
> to be used for UTF-N code units.
> 2. The codecvt<charN_t, char8_t, ...> facets are invariant with
> respect to locale because the conversions specified in
> [locale.codecvt.general]p3
> <http://eel.is/c++draft/locale.codecvt#general-3> directly
> specify the encodings involved without deference to locale.
>
> I believe the first interpretation reflects the desired intent.
>
> The second interpretation is problematic. If the rationale is
> based on the specified behavior of the codecvt specializations (as
> distinct from the facets), then the codecvt<char, char, ...>
> specialization (which performs no conversion) and the
> codecvt<charN_t, char, ...> specializations (which convert between
> UTF encodings) are also locale invariant. That seems clearly not
> to be the intent.
>
> I now believe that I was in error when writing P0482
> <https://wg21.link/p0482>(char8_t: A type for UTF-8 characters and
> strings) <https://wg21.link/p0482> in the following three ways
> (all based on a mistaken focus on the behavior of the
> specializations as opposed to their use as facets):
>
> 1. The codecvt<charN_t, char8_t, ...> facets should not have been
> added (based on the rationale for interpretation #1 above).
> 2. The codecvt<charN_t, char, ...> facets should not have been
> deprecated. The rationale used for their deprecation was that
> the char8_t-based specializations perform the same conversions
> and that the char8_t specializations should be preferred. But
> this rationale failed to appreciate how facet selection occurs
> (by type, not by desired encoding).
> 3. A codecvt<char8_t, char, ...> facet should have been added and
> its associated specialization should convert between UTF-8 and
> the current locale encoding (for consistency with the wchar_t
> facet) or not convert at all (for consistency with the charN_t
> facets).
>
> Likewise for the codecvt_byname facets.
>
> Based on the above, I'm inclined towards writing a paper to
> address the three errors listed above and, in doing so, resolve
> the LWG issue.
>
> Please share your thoughts, both with regard to your understanding
> of the poll when it was taken and with regard to the suggested
> direction above.
>
>
>
>
> Agreed on 1 but I believe victor's resolutions gets us there.
Partially; codecvt_byname is not included in it; at least not yet.
>
> 2/3... Given our time is limited, I'd rather we focus on Jeanheyd's
> paper. It's a critical facility for which usability is important.
> codecvt may not be the thing we want to invest ressources into (that's
> the nicest thing i can say about this facility).
Absolutely. We'll definitely prioritize JeanHeyd's work as soon as a
revised paper is made available. Unless I missed it, no new revisions
have been submitted.
> 3 is also lossy so not usually a desirable operation.
Can you elaborate? If we were to specify it to convert to the locale
encoding (like the wchar_t facet), then it would be lossy. But if we
specify it to "convert" to UTF-8 (like the other charN_t facets), then
it isn't.
>
> Either way, i would hate for 1 to be tied to 2 and 3.
>
> I believe 1 should be treated now so that it's part of c++23. It's an
> easy fix. We'd avoid tempting hyrum's law for 3 more years...
That's fair. I imagine that we would adopt this as a DR though, so
getting it in for C++23 probably doesn't matter much.
>
> I have further thoughts on the conflation of locales and encodings but
> it ultimately doesn't matter here. I'm afraid we won't get past that
> until serious work is done to replace std::locale (which i don't see
> happening anytime soon).

That conflation continues to exist on all major platforms (at least
technically, not necessarily in practice for some of them). I would be
interested in your thoughts on how a replacement for std::locale would help.

Tom.

>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-10-10 22:20:55