On 10/10/22 3:42 AM, Corentin Jabot wrote:


On Mon, Oct 10, 2022, 07:15 Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

While writing the meeting summary for the 2022-09-28 telecon, it occurred to me that the poll taken for LWG #3767 (codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale) was not completely clear regarding our intent. The poll (for which we had unanimous consent) was:

Poll: SG16 agrees that the codecvt facets mentioned in LWG3767 "codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale" are intended to be invariant with respect to locale.

Our discussion included the codecvt<charN_t, char, ...> facets in addition to the char8_t ones but the LWG issue does not mention the former. The wording of the poll suggests that we only intended it to apply to the char8_t based facets, but it isn't clear to me whether that was intended. Regardless, if our conclusion is that the char8_t-based facets need not be present, then we should presumably take a position on whether the deprecated facets should be un-deprecated.

Neither the poll nor the discussion are explicit regarding why we consider these facets to be invariant with respect to locale. I think there are at least two reasonable interpretations.

  1. The codecvt<charN_t, char8_t, ...> facets are invariant with respect to locale because the charN_t types are only intended to be used for UTF-N code units.
  2. The codecvt<charN_t, char8_t, ...> facets are invariant with respect to locale because the conversions specified in [locale.codecvt.general]p3 directly specify the encodings involved without deference to locale.

I believe the first interpretation reflects the desired intent.

The second interpretation is problematic. If the rationale is based on the specified behavior of the codecvt specializations (as distinct from the facets), then the codecvt<char, char, ...> specialization (which performs no conversion) and the codecvt<charN_t, char, ...> specializations (which convert between UTF encodings) are also locale invariant. That seems clearly not to be the intent.

I now believe that I was in error when writing P0482 (char8_t: A type for UTF-8 characters and strings) in the following three ways (all based on a mistaken focus on the behavior of the specializations as opposed to their use as facets):

  1. The codecvt<charN_t, char8_t, ...> facets should not have been added (based on the rationale for interpretation #1 above).
  2. The codecvt<charN_t, char, ...> facets should not have been deprecated. The rationale used for their deprecation was that the char8_t-based specializations perform the same conversions and that the char8_t specializations should be preferred. But this rationale failed to appreciate how facet selection occurs (by type, not by desired encoding).
  3. A codecvt<char8_t, char, ...> facet should have been added and its associated specialization should convert between UTF-8 and the current locale encoding (for consistency with the wchar_t facet) or not convert at all (for consistency with the charN_t facets).

Likewise for the codecvt_byname facets.

Based on the above, I'm inclined towards writing a paper to address the three errors listed above and, in doing so, resolve the LWG issue.

Please share your thoughts, both with regard to your understanding of the poll when it was taken and with regard to the suggested direction above.




Agreed on 1 but I believe victor's resolutions gets us there.
Partially; codecvt_byname is not included in it; at least not yet.

2/3... Given our time is limited, I'd rather we focus on Jeanheyd's paper. It's a critical facility for which usability is important. 
codecvt may not be the thing we want to invest ressources into (that's the nicest thing i can say about this facility).
Absolutely. We'll definitely prioritize JeanHeyd's work as soon as a revised paper is made available. Unless I missed it, no new revisions have been submitted.
3 is also lossy so not usually a desirable operation.
Can you elaborate? If we were to specify it to convert to the locale encoding (like the wchar_t facet), then it would be lossy. But if we specify it to "convert" to UTF-8 (like the other charN_t facets), then it isn't.

Either way, i would hate for 1 to be tied to 2 and 3.

I believe 1 should be treated now so that it's part of c++23. It's an easy fix. We'd avoid tempting hyrum's law for 3 more years...
That's fair. I imagine that we would adopt this as a DR though, so getting it in for C++23 probably doesn't matter much.

I have further thoughts on the conflation of locales and encodings but it ultimately doesn't matter here. I'm afraid we won't get past that until serious work is done to replace std::locale (which i don't see happening anytime soon).

That conflation continues to exist on all major platforms (at least technically, not necessarily in practice for some of them). I would be interested in your thoughts on how a replacement for std::locale would help.

Tom.