ISOCPP sg16 List: Re: Agenda for the 2022-09-28 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 26 Sep 2022 10:41:59 +0200

On Mon, Sep 26, 2022 at 1:12 AM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> SG16 will hold a telecon on Wednesday, September 28th, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20220928T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
> ).
>
> The agenda is:
>
> - LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to
> locale <https://cplusplus.github.io/LWG/issue3767>
> - LWG #3412: §[format.string.std] references to "Unicode encoding"
> unclear <https://cplusplus.github.io/LWG/issue3412>
> - Handling ill-formed Unicode in the library
> - See prior mailing list discussion
> <https://lists.isocpp.org/sg16/2022/09/3369.php>.
>
> LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale
> <https://cplusplus.github.io/LWG/issue3767>
>
> This issue was recently filed by Victor and poses once again a question
> familiar to all those that have discussed locale before, "But why?". Table
> [locale.category.facets]
> <http://eel.is/c++draft/locale.category#tab:locale.category.facets>
> includes the following codecvt specializations in the specified set of
> ctype category facets. A matching set of codecvt_byname specializations
> is likewise present in table [locale.spec]
> <http://eel.is/c++draft/locale.category#tab:locale.spec>.
>
> 1. codecvt<char, char, mbstate_t>
> 2. codecvt<char16_t, char8_t, mbstate_t> (Added via P0482
> <https://wg21.link/p0482>)
> 3. codecvt<char32_t, char8_t, mbstate_t> (Added via P0482
> <https://wg21.link/p0482>)
> 4. codecvt<wchar_t, char, mbstate_t>
>
> That isn't all the required facets though; the following are included via
> [depr.locale.category] <http://eel.is/c++draft/depr.locale.category>;
> they were deprecated, but not removed, by P0482 <https://wg21.link/p0482>.
>
> 1. codecvt<char16_t, char, mbstate_t>
> 2. codecvt<char32_t, char, mbstate_t>
>
> The interesting thing that Victor points out is that, even before P0482
> (char8_t: A type for UTF-8 characters and strings)
> <https://wg21.link/p0482> and P1041 (Make char16_t/char32_t string
> literals be UTF-16/32) <https://wg21.link/p1041>, the char16_t and
> char32_t specializations were specified to convert between UTF-16/UTF-32
> and UTF-8 and are therefore locale independent. So why were they included
> as locale facets? And if they have no reason to be included, then the
> char8_t specializations surely should not be either.
>
> The codecvt facets are only used (within the standard) by
> std::basic_filebuf (see [filebuf.general]p5
> <http://eel.is/c++draft/filebuf.general#5>) and std::filesystem::path
> (see [fs.path.construct]p6 <http://eel.is/c++draft/fs.path.construct#6>).
> The former only uses the char-based specializations (to convert between
> its parameterized charT character type and char) and the latter only uses
> the wchar_t specialization. It is worth noting that, because of
> [locale.codecvt.virtuals]p4
> <http://eel.is/c++draft/locale.codecvt.virtuals#4>, std::basic_filebuf is
> unable to use the char16_t-based specialization (see also SG16 issue #33
> <https://github.com/sg16-unicode/sg16/issues/33>).
>
> One of the motivations stated for Victor's proposed resolution is to avoid
> the overhead of loading these facets. It would be helpful to understand 1)
> what the overhead cost is in practice (presumably enough for someone to
> have noticed it and for Victor to have reported it), and 2) whether
> implementors would actually change their implementations.
>
> codecvt specializations may be used as base classes of user-defined class
> types that perform some kind of specialized conversion. It is therefore
> possible for a std::locale object to be constructed such that the facet
> returned by, for example, use_facet<std::codecvt<char16_t, char,
> mbstate_t>>(loc), implements a conversion between UTF-16 and a locale
> dependent encoding. Removing the noted specializations would technically be
> a breaking change due to impact to has_facet and use_facet.
>
> Our goals when discussing this issue will be to determine 1) whether we
> have a clear direction for a change, and 2) whether there is consensus for
> spending time addressing the issue.
> LWG #3412: §[format.string.std] references to "Unicode encoding" unclear
> <https://cplusplus.github.io/LWG/issue3412>
>
> This issue was reported by Hubert a couple of years ago and it bravely
> asks the question of what is meant by "Unicode encoding" in various parts
> of [format.string.std] <https://eel.is/c++draft/format.string.std>. The
> Unicode standard specifies three Unicode encoding forms and seven Unicode
> encoding schemes. But what about UTF-7, UTF-EBCDIC, and GB18030? Do these
> count as Unicode encodings for the purposes of the C++ standard? The LWG
> issue does not provide a proposed resolution.
>
> Our goals for this issue will be 1) to determine whether we have a clear
> understanding of the intent and consensus for a resolution direction, and
> 2) to identify someone willing to draft a proposed resolution.
>
There are further issues here.
The width of grapheme is independent of encodings.
We are just not forcing implementation not to decode. Is that what we want?
I don't think it is useful.
Most encodings cannot represent any of the wide codepoints, the wideness of
codepoints in shift jis can be derived without doing a full decoding.

Suggested resolution:
For a string decoded to a sequence of unicode codepoints, its width is the
sum of estimated widths of the first code points in its extended grapheme
clusters.

If the intent is for implementers to throw their hands in the air when the
encoding is not "a unicode encoding", then surely
we want to support UTF-8/16/32 and that's it. UTF-EBCDIC isn't more
important or special than shift-jis and there is no reason for one
encoding to have privileged handling over the other.

More generally, any unicode that can round trip through Unicode should
qualify as Unicode encoding, but I don't think we have a definition of that
anywhere.
Unicode defines Unicode Encoding Form
> A character encoding form that assigns each Unicode scalar value to a
unique code unit sequence

> Handling ill-formed Unicode in the library
> <https://lists.isocpp.org/sg16/2022/09/3369.php>
>
> The last agenda item comes from Mark's recent discussion on the SG16
> mailing list <https://lists.isocpp.org/sg16/2022/09/3369.php> where he
> boldly posits the existence of ill-formed Unicode text. Discussion
> determined that one of the examples in [format.string.escaped]p3
> <https://eel.is/c++draft/format.string.escaped#3> is incorrect; s5 should
> have a result value of ["\x{c3}("], not ["\x{c3}\x{28}"]. Further
> discussion appears to be needed to settle how width estimation should be
> performed when ill-formed Unicode text is present; which PR-121
> <http://unicode.org/review/pr-121.html> policy should be used in these
> cases?
>
> Our goals for this issue will be to 1) determine if behavior should be
> well-defined in the face of ill-formed text, 2) what that behavior should
> be, and 3) how we should proceed with addressing the issue (LWG issue or
> paper; note that NB comment deadlines are rapidly approaching).
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-09-26 08:42:13