C++ Logo

sg16

Advanced search

Re: Further thoughts on LWG #3767 (codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale)

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 12 Oct 2022 10:34:09 -0700
Considering that codecvt facets are a terrible transcoding API we should
definitely not add new ones, regardless of past mistakes. Undeprecation
seems like a bad idea for the same reason. If anything we should be aiming
at removing those facets, not bringing them back.

I agree with Corentin that it would be nice to remove obviously wrong
codecvt<charN_t, char8_t, ...> facets in the C++23 timeframe to reduce the
chance of misguided adoption.

I don't see any problems with adjusting the proposed resolution to handle
byname facets, it was a trivial omission on my part.

Cheers,
Victor


On Mon, Oct 10, 2022 at 3:20 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> On 10/10/22 3:42 AM, Corentin Jabot wrote:
>
>
>
> On Mon, Oct 10, 2022, 07:15 Tom Honermann via SG16 <sg16_at_[hidden]>
> wrote:
>
>> While writing the meeting summary for the 2022-09-28 telecon
>> <https://github.com/sg16-unicode/sg16-meetings#september-28th-2022>, it
>> occurred to me that the poll taken for LWG #3767 (codecvt<charN_t,
>> char8_t, mbstate_t> incorrectly added to locale
>> <https://cplusplus.github.io/LWG/issue3767>) was not completely clear
>> regarding our intent. The poll (for which we had unanimous consent) was:
>>
>> *Poll: SG16 agrees that the codecvt facets mentioned in LWG3767
>> "codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale" are
>> intended to be invariant with respect to locale.*
>>
>> Our discussion included the codecvt<charN_t, char, ...> facets in
>> addition to the char8_t ones but the LWG issue does not mention the
>> former. The wording of the poll suggests that we only intended it to apply
>> to the char8_t based facets, but it isn't clear to me whether that was
>> intended. Regardless, if our conclusion is that the char8_t-based facets
>> need not be present, then we should presumably take a position on whether
>> the deprecated facets should be un-deprecated.
>>
>> Neither the poll nor the discussion are explicit regarding why we
>> consider these facets to be invariant with respect to locale. I think there
>> are at least two reasonable interpretations.
>>
>> 1. The codecvt<charN_t, char8_t, ...> facets are invariant with
>> respect to locale because the charN_t types are only intended to be
>> used for UTF-N code units.
>> 2. The codecvt<charN_t, char8_t, ...> facets are invariant with
>> respect to locale because the conversions specified in
>> [locale.codecvt.general]p3
>> <http://eel.is/c++draft/locale.codecvt#general-3> directly specify
>> the encodings involved without deference to locale.
>>
>> I believe the first interpretation reflects the desired intent.
>>
>> The second interpretation is problematic. If the rationale is based on
>> the specified behavior of the codecvt specializations (as distinct from
>> the facets), then the codecvt<char, char, ...> specialization (which
>> performs no conversion) and the codecvt<charN_t, char, ...>
>> specializations (which convert between UTF encodings) are also locale
>> invariant. That seems clearly not to be the intent.
>>
>> I now believe that I was in error when writing P0482
>> <https://wg21.link/p0482> (char8_t: A type for UTF-8 characters and
>> strings) <https://wg21.link/p0482> in the following three ways (all
>> based on a mistaken focus on the behavior of the specializations as opposed
>> to their use as facets):
>>
>> 1. The codecvt<charN_t, char8_t, ...> facets should not have been
>> added (based on the rationale for interpretation #1 above).
>> 2. The codecvt<charN_t, char, ...> facets should not have been
>> deprecated. The rationale used for their deprecation was that the
>> char8_t-based specializations perform the same conversions and that
>> the char8_t specializations should be preferred. But this rationale
>> failed to appreciate how facet selection occurs (by type, not by desired
>> encoding).
>> 3. A codecvt<char8_t, char, ...> facet should have been added and its
>> associated specialization should convert between UTF-8 and the current
>> locale encoding (for consistency with the wchar_t facet) or not
>> convert at all (for consistency with the charN_t facets).
>>
>> Likewise for the codecvt_byname facets.
>>
>> Based on the above, I'm inclined towards writing a paper to address the
>> three errors listed above and, in doing so, resolve the LWG issue.
>>
>> Please share your thoughts, both with regard to your understanding of the
>> poll when it was taken and with regard to the suggested direction above.
>>
>
>
>
> Agreed on 1 but I believe victor's resolutions gets us there.
>
> Partially; codecvt_byname is not included in it; at least not yet.
>
>
> 2/3... Given our time is limited, I'd rather we focus on Jeanheyd's paper.
> It's a critical facility for which usability is important.
> codecvt may not be the thing we want to invest ressources into (that's
> the nicest thing i can say about this facility).
>
> Absolutely. We'll definitely prioritize JeanHeyd's work as soon as a
> revised paper is made available. Unless I missed it, no new revisions have
> been submitted.
>
> 3 is also lossy so not usually a desirable operation.
>
> Can you elaborate? If we were to specify it to convert to the locale
> encoding (like the wchar_t facet), then it would be lossy. But if we
> specify it to "convert" to UTF-8 (like the other charN_t facets), then it
> isn't.
>
>
> Either way, i would hate for 1 to be tied to 2 and 3.
>
> I believe 1 should be treated now so that it's part of c++23. It's an easy
> fix. We'd avoid tempting hyrum's law for 3 more years...
>
> That's fair. I imagine that we would adopt this as a DR though, so getting
> it in for C++23 probably doesn't matter much.
>
>
> I have further thoughts on the conflation of locales and encodings but it
> ultimately doesn't matter here. I'm afraid we won't get past that until
> serious work is done to replace std::locale (which i don't see happening
> anytime soon).
>
> That conflation continues to exist on all major platforms (at least
> technically, not necessarily in practice for some of them). I would be
> interested in your thoughts on how a replacement for std::locale would
> help.
>
> Tom.
>
>
> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-10-12 17:34:20