ISOCPP sg16 List: Agenda for the 2022-09-28 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 25 Sep 2022 19:12:09 -0400

SG16 will hold a telecon on Wednesday, September 28th, at 19:30 UTC
(timezone conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20220928T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).

The agenda is:

  * LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to
    locale <https://cplusplus.github.io/LWG/issue3767>
  * LWG #3412: §[format.string.std] references to "Unicode encoding"
    unclear <https://cplusplus.github.io/LWG/issue3412>
  * Handling ill-formed Unicode in the library
      o See prior mailing list discussion
        <https://lists.isocpp.org/sg16/2022/09/3369.php>.

  LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to
  locale <https://cplusplus.github.io/LWG/issue3767>

This issue was recently filed by Victor and poses once again a question
familiar to all those that have discussed locale before, "But why?".
Table [locale.category.facets]
<http://eel.is/c++draft/locale.category#tab:locale.category.facets>
includes the following codecvt specializations in the specified set of
ctype category facets. A matching set of codecvt_byname specializations
is likewise present in table [locale.spec]
<http://eel.is/c++draft/locale.category#tab:locale.spec>.

1. codecvt<char, char, mbstate_t>
2. codecvt<char16_t, char8_t, mbstate_t> (Added via P0482
    <https://wg21.link/p0482>)
3. codecvt<char32_t, char8_t, mbstate_t> (Added via P0482
    <https://wg21.link/p0482>)
4. codecvt<wchar_t, char, mbstate_t>

That isn't all the required facets though; the following are included
via [depr.locale.category]
<http://eel.is/c++draft/depr.locale.category>; they were deprecated, but
not removed, by P0482 <https://wg21.link/p0482>.

5. codecvt<char16_t, char, mbstate_t>
6. codecvt<char32_t, char, mbstate_t>

The interesting thing that Victor points out is that, even before P0482
(char8_t: A type for UTF-8 characters and strings)
<https://wg21.link/p0482> and P1041 (Make char16_t/char32_t string
literals be UTF-16/32) <https://wg21.link/p1041>, the char16_t and
char32_t specializations were specified to convert between UTF-16/UTF-32
and UTF-8 and are therefore locale independent. So why were they
included as locale facets? And if they have no reason to be included,
then the char8_t specializations surely should not be either.

The codecvt facets are only used (within the standard) by
std::basic_filebuf (see [filebuf.general]p5
<http://eel.is/c++draft/filebuf.general#5>) and std::filesystem::path
(see [fs.path.construct]p6
<http://eel.is/c++draft/fs.path.construct#6>). The former only uses the
char-based specializations (to convert between its parameterized charT
character type and char) and the latter only uses the wchar_t
specialization. It is worth noting that, because of
[locale.codecvt.virtuals]p4
<http://eel.is/c++draft/locale.codecvt.virtuals#4>, std::basic_filebuf
is unable to use the char16_t-based specialization (see also SG16 issue
#33 <https://github.com/sg16-unicode/sg16/issues/33>).

One of the motivations stated for Victor's proposed resolution is to
avoid the overhead of loading these facets. It would be helpful to
understand 1) what the overhead cost is in practice (presumably enough
for someone to have noticed it and for Victor to have reported it), and
2) whether implementors would actually change their implementations.

codecvt specializations may be used as base classes of user-defined
class types that perform some kind of specialized conversion. It is
therefore possible for a std::locale object to be constructed such that
the facet returned by, for example, use_facet<std::codecvt<char16_t,
char, mbstate_t>>(loc), implements a conversion between UTF-16 and a
locale dependent encoding. Removing the noted specializations would
technically be a breaking change due to impact to has_facet and use_facet.

Our goals when discussing this issue will be to determine 1) whether we
have a clear direction for a change, and 2) whether there is consensus
for spending time addressing the issue.

  LWG #3412: §[format.string.std] references to "Unicode encoding"
  unclear <https://cplusplus.github.io/LWG/issue3412>

This issue was reported by Hubert a couple of years ago and it bravely
asks the question of what is meant by "Unicode encoding" in various
parts of [format.string.std]
<https://eel.is/c++draft/format.string.std>. The Unicode standard
specifies three Unicode encoding forms and seven Unicode encoding
schemes. But what about UTF-7, UTF-EBCDIC, and GB18030? Do these count
as Unicode encodings for the purposes of the C++ standard? The LWG issue
does not provide a proposed resolution.

Our goals for this issue will be 1) to determine whether we have a clear
understanding of the intent and consensus for a resolution direction,
and 2) to identify someone willing to draft a proposed resolution.

  Handling ill-formed Unicode in the library
  <https://lists.isocpp.org/sg16/2022/09/3369.php>

The last agenda item comes from Mark's recent discussion on the SG16
mailing list <https://lists.isocpp.org/sg16/2022/09/3369.php> where he
boldly posits the existence of ill-formed Unicode text. Discussion
determined that one of the examples in [format.string.escaped]p3
<https://eel.is/c++draft/format.string.escaped#3> is incorrect; s5
should have a result value of ["\x{c3}("], not ["\x{c3}\x{28}"]. Further
discussion appears to be needed to settle how width estimation should be
performed when ill-formed Unicode text is present; which PR-121
<http://unicode.org/review/pr-121.html> policy should be used in these
cases?

Our goals for this issue will be to 1) determine if behavior should be
well-defined in the face of ill-formed text, 2) what that behavior
should be, and 3) how we should proceed with addressing the issue (LWG
issue or paper; note that NB comment deadlines are rapidly approaching).

Tom.

Received on 2022-09-25 23:12:12