ISOCPP sg16 List: Re: Agenda for the 2022-09-28 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 27 Sep 2022 13:26:59 -0400

This is your friendly reminder that this meeting is taking place tomorrow.

Tom.

On 9/25/22 7:12 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a telecon on Wednesday, September 28th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20220928T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The agenda is:
>
> * LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added
> to locale <https://cplusplus.github.io/LWG/issue3767>
> * LWG #3412: §[format.string.std] references to "Unicode encoding"
> unclear <https://cplusplus.github.io/LWG/issue3412>
> * Handling ill-formed Unicode in the library
> o See prior mailing list discussion
> <https://lists.isocpp.org/sg16/2022/09/3369.php>.
>
>
> LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to
> locale <https://cplusplus.github.io/LWG/issue3767>
>
> This issue was recently filed by Victor and poses once again a
> question familiar to all those that have discussed locale before, "But
> why?". Table [locale.category.facets]
> <http://eel.is/c++draft/locale.category#tab:locale.category.facets>
> includes the following codecvt specializations in the specified set of
> ctype category facets. A matching set of codecvt_byname
> specializations is likewise present in table [locale.spec]
> <http://eel.is/c++draft/locale.category#tab:locale.spec>.
>
> 1. codecvt<char, char, mbstate_t>
> 2. codecvt<char16_t, char8_t, mbstate_t> (Added via P0482
> <https://wg21.link/p0482>)
> 3. codecvt<char32_t, char8_t, mbstate_t> (Added via P0482
> <https://wg21.link/p0482>)
> 4. codecvt<wchar_t, char, mbstate_t>
>
> That isn't all the required facets though; the following are included
> via [depr.locale.category]
> <http://eel.is/c++draft/depr.locale.category>; they were deprecated,
> but not removed, by P0482 <https://wg21.link/p0482>.
>
> 5. codecvt<char16_t, char, mbstate_t>
> 6. codecvt<char32_t, char, mbstate_t>
>
> The interesting thing that Victor points out is that, even before
> P0482 (char8_t: A type for UTF-8 characters and strings)
> <https://wg21.link/p0482> and P1041 (Make char16_t/char32_t string
> literals be UTF-16/32) <https://wg21.link/p1041>, the char16_t and
> char32_t specializations were specified to convert between
> UTF-16/UTF-32 and UTF-8 and are therefore locale independent. So why
> were they included as locale facets? And if they have no reason to be
> included, then the char8_t specializations surely should not be either.
>
> The codecvt facets are only used (within the standard) by
> std::basic_filebuf (see [filebuf.general]p5
> <http://eel.is/c++draft/filebuf.general#5>) and std::filesystem::path
> (see [fs.path.construct]p6
> <http://eel.is/c++draft/fs.path.construct#6>). The former only uses
> the char-based specializations (to convert between its parameterized
> charT character type and char) and the latter only uses the wchar_t
> specialization. It is worth noting that, because of
> [locale.codecvt.virtuals]p4
> <http://eel.is/c++draft/locale.codecvt.virtuals#4>, std::basic_filebuf
> is unable to use the char16_t-based specialization (see also SG16
> issue #33 <https://github.com/sg16-unicode/sg16/issues/33>).
>
> One of the motivations stated for Victor's proposed resolution is to
> avoid the overhead of loading these facets. It would be helpful to
> understand 1) what the overhead cost is in practice (presumably enough
> for someone to have noticed it and for Victor to have reported it),
> and 2) whether implementors would actually change their implementations.
>
> codecvt specializations may be used as base classes of user-defined
> class types that perform some kind of specialized conversion. It is
> therefore possible for a std::locale object to be constructed such
> that the facet returned by, for example,
> use_facet<std::codecvt<char16_t, char, mbstate_t>>(loc), implements a
> conversion between UTF-16 and a locale dependent encoding. Removing
> the noted specializations would technically be a breaking change due
> to impact to has_facet and use_facet.
>
> Our goals when discussing this issue will be to determine 1) whether
> we have a clear direction for a change, and 2) whether there is
> consensus for spending time addressing the issue.
>
>
> LWG #3412: §[format.string.std] references to "Unicode encoding"
> unclear <https://cplusplus.github.io/LWG/issue3412>
>
> This issue was reported by Hubert a couple of years ago and it bravely
> asks the question of what is meant by "Unicode encoding" in various
> parts of [format.string.std]
> <https://eel.is/c++draft/format.string.std>. The Unicode standard
> specifies three Unicode encoding forms and seven Unicode encoding
> schemes. But what about UTF-7, UTF-EBCDIC, and GB18030? Do these count
> as Unicode encodings for the purposes of the C++ standard? The LWG
> issue does not provide a proposed resolution.
>
> Our goals for this issue will be 1) to determine whether we have a
> clear understanding of the intent and consensus for a resolution
> direction, and 2) to identify someone willing to draft a proposed
> resolution.
>
>
> Handling ill-formed Unicode in the library
> <https://lists.isocpp.org/sg16/2022/09/3369.php>
>
> The last agenda item comes from Mark's recent discussion on the SG16
> mailing list <https://lists.isocpp.org/sg16/2022/09/3369.php> where he
> boldly posits the existence of ill-formed Unicode text. Discussion
> determined that one of the examples in [format.string.escaped]p3
> <https://eel.is/c++draft/format.string.escaped#3> is incorrect; s5
> should have a result value of ["\x{c3}("], not ["\x{c3}\x{28}"].
> Further discussion appears to be needed to settle how width estimation
> should be performed when ill-formed Unicode text is present; which
> PR-121 <http://unicode.org/review/pr-121.html> policy should be used
> in these cases?
>
> Our goals for this issue will be to 1) determine if behavior should be
> well-defined in the face of ill-formed text, 2) what that behavior
> should be, and 3) how we should proceed with addressing the issue (LWG
> issue or paper; note that NB comment deadlines are rapidly approaching).
>
> Tom.
>
>

Received on 2022-09-27 17:27:00