SG16 will hold a telecon on Wednesday, September 28th, at 19:30 UTC (timezone conversion).

The agenda is:

LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale
LWG #3412: §[format.string.std] references to "Unicode encoding" unclear
Handling ill-formed Unicode in the library

See prior mailing list discussion.

LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale

This issue was recently filed by Victor and poses once again a question familiar to all those that have discussed locale before, "But why?". Table [locale.category.facets] includes the following codecvt specializations in the specified set of ctype category facets. A matching set of codecvt_byname specializations is likewise present in table [locale.spec].

codecvt<char, char, mbstate_t>
codecvt<char16_t, char8_t, mbstate_t> (Added via P0482)
codecvt<char32_t, char8_t, mbstate_t> (Added via P0482)
codecvt<wchar_t, char, mbstate_t>

That isn't all the required facets though; the following are included via [depr.locale.category]; they were deprecated, but not removed, by P0482.

codecvt<char16_t, char, mbstate_t>
codecvt<char32_t, char, mbstate_t>

The interesting thing that Victor points out is that, even before P0482 (char8_t: A type for UTF-8 characters and strings) and P1041 (Make char16_t/char32_t string literals be UTF-16/32), the char16_t and char32_t specializations were specified to convert between UTF-16/UTF-32 and UTF-8 and are therefore locale independent. So why were they included as locale facets? And if they have no reason to be included, then the char8_t specializations surely should not be either.

The codecvt facets are only used (within the standard) by std::basic_filebuf (see [filebuf.general]p5) and std::filesystem::path (see [fs.path.construct]p6). The former only uses the char-based specializations (to convert between its parameterized charT character type and char) and the latter only uses the wchar_t specialization. It is worth noting that, because of [locale.codecvt.virtuals]p4, std::basic_filebuf is unable to use the char16_t-based specialization (see also SG16 issue #33).

One of the motivations stated for Victor's proposed resolution is to avoid the overhead of loading these facets. It would be helpful to understand 1) what the overhead cost is in practice (presumably enough for someone to have noticed it and for Victor to have reported it), and 2) whether implementors would actually change their implementations.

codecvt specializations may be used as base classes of user-defined class types that perform some kind of specialized conversion. It is therefore possible for a std::locale object to be constructed such that the facet returned by, for example, use_facet<std::codecvt<char16_t, char, mbstate_t>>(loc), implements a conversion between UTF-16 and a locale dependent encoding. Removing the noted specializations would technically be a breaking change due to impact to has_facet and use_facet.

Our goals when discussing this issue will be to determine 1) whether we have a clear direction for a change, and 2) whether there is consensus for spending time addressing the issue.

LWG #3412: §[format.string.std] references to "Unicode encoding" unclear

This issue was reported by Hubert a couple of years ago and it bravely asks the question of what is meant by "Unicode encoding" in various parts of [format.string.std]. The Unicode standard specifies three Unicode encoding forms and seven Unicode encoding schemes. But what about UTF-7, UTF-EBCDIC, and GB18030? Do these count as Unicode encodings for the purposes of the C++ standard? The LWG issue does not provide a proposed resolution.

Our goals for this issue will be 1) to determine whether we have a clear understanding of the intent and consensus for a resolution direction, and 2) to identify someone willing to draft a proposed resolution.

Handling ill-formed Unicode in the library

The last agenda item comes from Mark's recent discussion on the SG16 mailing list where he boldly posits the existence of ill-formed Unicode text. Discussion determined that one of the examples in [format.string.escaped]p3 is incorrect; s5 should have a result value of ["\x{c3}("], not ["\x{c3}\x{28}"]. Further discussion appears to be needed to settle how width estimation should be performed when ill-formed Unicode text is present; which PR-121 policy should be used in these cases?

Our goals for this issue will be to 1) determine if behavior should be well-defined in the face of ill-formed text, 2) what that behavior should be, and 3) how we should proceed with addressing the issue (LWG issue or paper; note that NB comment deadlines are rapidly approaching).

Tom.