While writing the meeting summary for the 2022-09-28 telecon, it occurred to me that the poll taken for LWG #3767 (codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale) was not completely clear regarding our intent. The poll (for which we had unanimous consent) was:
Poll: SG16 agrees that the codecvt facets mentioned in LWG3767 "codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale" are intended to be invariant with respect to locale.
Our discussion included the codecvt<charN_t, char, ...> facets in addition to the char8_t ones but the LWG issue does not mention the former. The wording of the poll suggests that we only intended it to apply to the char8_t based facets, but it isn't clear to me whether that was intended. Regardless, if our conclusion is that the char8_t-based facets need not be present, then we should presumably take a position on whether the deprecated facets should be un-deprecated.
Neither the poll nor the discussion are explicit regarding why we consider these facets to be invariant with respect to locale. I think there are at least two reasonable interpretations.
- The codecvt<charN_t, char8_t, ...> facets are invariant with respect to locale because the charN_t types are only intended to be used for UTF-N code units.
- The codecvt<charN_t, char8_t, ...> facets are invariant with respect to locale because the conversions specified in [locale.codecvt.general]p3 directly specify the encodings involved without deference to locale.
I believe the first interpretation reflects the desired intent.
The second interpretation is problematic. If the rationale is based on the specified behavior of the codecvt specializations (as distinct from the facets), then the codecvt<char, char, ...> specialization (which performs no conversion) and the codecvt<charN_t, char, ...> specializations (which convert between UTF encodings) are also locale invariant. That seems clearly not to be the intent.
I now believe that I was in error when writing P0482 (char8_t: A type for UTF-8 characters and strings) in the following three ways (all based on a mistaken focus on the behavior of the specializations as opposed to their use as facets):
- The codecvt<charN_t, char8_t, ...> facets should not have been added (based on the rationale for interpretation #1 above).
- The codecvt<charN_t, char, ...> facets should not have been deprecated. The rationale used for their deprecation was that the char8_t-based specializations perform the same conversions and that the char8_t specializations should be preferred. But this rationale failed to appreciate how facet selection occurs (by type, not by desired encoding).
- A codecvt<char8_t, char, ...> facet should have been added and its associated specialization should convert between UTF-8 and the current locale encoding (for consistency with the wchar_t facet) or not convert at all (for consistency with the charN_t facets).
Likewise for the codecvt_byname facets.
Based on the above, I'm inclined towards writing a paper to address the three errors listed above and, in doing so, resolve the LWG issue.
Please share your thoughts, both with regard to your understanding of the poll when it was taken and with regard to the suggested direction above.
SG16 mailing list