ISOCPP sg16 List: Re: Agenda for the 2023-10-25 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 25 Oct 2023 21:26:14 +0200

On Wed, Oct 25, 2023 at 4:56 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> On 10/24/23 1:11 AM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a telecon on Wednesday, October 25th, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20231025T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>
> ).
>
> The agenda follows.
>
> - charN_t, char_traits, codecvt, and iostreams:
> - P2873R0: Remove Deprecated Locale Category Facets For Unicode
> from C++26 <https://wg21.link/p2873r0>
> - LWG 3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added
> to locale <https://wg21.link/lwg3767>
> - LWG 2959: char_traits<char16_t>::eof is a valid UTF-16 code unit
> <https://wg21.link/lwg2959>
> - SG16 #32: std::char_traits<char16_t>::eof() requires
> uint_least16_t to be larger than 16 bits
> <https://github.com/sg16-unicode/sg16/issues/32>
> - SG16 #33: A correct codecvt facet that works with
> basic_filebuf can't do UTF conversions
> <https://github.com/sg16-unicode/sg16/issues/33>
>
> Hang on, this is going to be a bumpy ride.
>
> When char16_t and char32_t were added for C++11, the standard library was
> extended to support corresponding specializations of std::char_traits (
> [char.traits.general]p1 <http://eel.is/c++draft/char.traits.general#1>)
> and std::basic_string ([string.classes.general]p1
> <http://eel.is/c++draft/string.classes#general-1>). Curiously, type
> aliases were added for specializations of the std::fpos ([iosfwd.syn]
> <http://eel.is/c++draft/iosfwd.syn#lib:fpos>) class template (but only in
> the synopsis) and support for these types was added for the std::codecvt (
> [tab:locale.category.facets]
> <http://eel.is/c++draft/locale.category#tab:locale.category.facets>) and
> std::codecvt_byname ([tab:locale.spec]
> <http://eel.is/c++draft/locale.category#tab:locale.spec>) locale facets,
> but not for any of the other locale facets nor for iostreams in general.
> Support for these types was added to std::basic_string_view (
> [string.view.synop] <http://eel.is/c++draft/string.view.synop>) and
> std::filesystem::path ([fs.path.type.cvt]p2
> <http://eel.is/c++draft/fs.path.type.cvt#2>) in C++17, but no additional
> support was ever extended to iostreams. The status quo is thus that the
> standard requires implementations to provide some fragments (std::fpos,
> std::codecvt, and std::codecvt_byname) of iostream support for these
> types despite there being no use of these type aliases and specializations
> in the standard; implementations are not required to support streams of
> char16_t or char32_t.
>
> std::char_traits is used by both the string library (e.g.,
> std::basic_string) and iostreams. However, the string library only
> depends on some of the std::char_traits members; it does not make use of
> the int_type member type alias nor any of the member functions that
> depend on that type (eof(), not_eof(), to_char_type(), to_int_type(),
> eq_int_type()). Per LWG 2959 <https://wg21.link/lwg2959> and SG16 #32
> <https://github.com/sg16-unicode/sg16/issues/32>, the specified
> std::char_traits<char16_t> specialization has a defect; all char16_t
> values are valid code unit values, but the int_type member type alias is
> defined as uint_least16_t (the same underlying type as char16_t) and it
> is thus unable to hold a distinct value for EOF. The obvious fix is to use
> a larger type for int_type, but that would result in an ABI break. I
> recently asked the ABI review group if there are any known tricks they
> could deploy to mitigate an ABI break, but no direct solutions were
> identified; a suggestion to provide an alternative type for
> std::char_traits<char16_t> that programmers would have to explicitly use
> instead of the broken specialization was offered. That is an option, but
> since the problematic int_type member is not actually used by any
> functionality the standard requires implementors to provide, an ABI break
> in this case might have little practical consequence
>
>

> When char8_t was added for C++20 via P0482R6 (char8_t: A type for UTF-8
> characters and strings) <https://wg21.link/p0482>, I failed to understand
> the intended purpose for which std::codecvt was added to the standard. My
> impression of it at the time was that it was a poorly designed general
> transcoding facility; I failed to appreciate its significance as a locale
> facet as used by iostreams. This resulted in two mistakes:
>
> 1. I deprecated the following specializations (and their use as locale
> category facets):
> std::codecvt<char16_t, char, std::mbstate_t>
> std::codecvt<char32_t, char, std::mbstate_t>
> std::codecvt_byname<char16_t, char, std::mbstate_t>
> std::codecvt_byname<char32_t, char, std::mbstate_t>
> 2. I added the following specializations as required locale category
> facets (adding the specializations themselves is arguably not a mistake,
> but adding them as locale category facets is):
> std::codecvt<char16_t, char8_t, std::mbstate_t>
> std::codecvt<char32_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char16_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char32_t, char8_t, std::mbstate_t>
>
> Note that std::codecvt facets are only used by std::basic_filebuf which
> only ever converts to and from elements of type char; the facets that
> convert to and from char8_t are not substitutable for that purpose.
>
> P2873R0 <https://wg21.link/p2873r0>, which SG16 already approved (or,
> rather, did not object to) during the 2023-05-26 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#may-24th-2023>, now seeks
> to remove the deprecated specializations. LWG 3767
> <https://wg21.link/lwg3767> tracks addressing the incorrect addition of
> the char8_t specializations as locale facets.
>
> Arguably, P0482R6 <https://wg21.link/p0482> should have added the
> following specializations as locale facets:
>
> - std::codecvt<char8_t, char, std::mbstate_t>
> - std::codecvt_byname<char8_t, char, std::mbstate_t>
>
> The only specification for std::codecvt_byname in the standard is the
> synopsis in [locale.codecvt.byname]
> <http://eel.is/c++draft/locale.codecvt.byname>; there is no other wording
> present.
>
> As mentioned, the standard does not require implementations to provide
> iostream support for the charN_t types. However, implementations may do
> so as an extension. If they do, then, per [filebuf.general]p7
> <http://eel.is/c++draft/input.output#filebuf.general-7>, specializations
> of std::codecvt<charN_t, char, std::mbstate_t> are required to be
> available via a call to std::use_facet() for the imbued locale. In which
> case, per the standard, the status of the necessary specializations are:
>
> - std::codecvt<char8_t, char, std::mbstate_t> # Not specified.
> - std::codecvt<char16_t, char, std::mbstate_t> # Deprecated.
> - std::codecvt<char32_t, char, std::mbstate_t> # Deprecated.
>
> If it is desirable to provide a better foundation for iostream support of
> the charN_t types, either for a future version of the standard, or for
> implementations that want to provide such support as an extension, we could
> undeprecate the previously deprecated specializations and add the missing
> one for char8_t. Since iostreams does not support charN_t in the standard
> today and since the char16_t and char32_t specializations have already
> been deprecated for two release cycles, perhaps it is even reasonable to
> change their behavior so that they convert to and from the locale encoding
> rather than UTF-8. This would remove the existing inconsistency with the
> corresponding char and wchar_t specializations that was part of the
> motivation for their deprecation in the first place (see the discussion of
> codecvt in the Motivation section of P0482R6
> <https://wg21.link/p0482r6#motivation>).
>
> However, an endeavor to improve the situation for iostreams and charN_t next
> runs into SG16 #33 <https://github.com/sg16-unicode/sg16/issues/33>;
> std::basic_fstream does not support the UTF-8 and UTF-16 encodings for
> the "internal" side of a std::codecvt conversion because
> std::basic_filebuf requires that, per [locale.codecvt.virtuals]p4
> <http://eel.is/c++draft/locale.codecvt#virtuals-4> and its related
> footnote <http://eel.is/c++draft/locale.codecvt#footnote-246>, "internal"
> characters are mapped 1-N to "external" characters. This is an existing
> issue for std::basic_fstream<wchar_t> with UTF-16 data.
>
> The Microsoft and libstdc++ standard library implementations appear to
> support iostreams with charN_t types; at least on the surface. Libc++
> intentionally does not provide definitions for charN_t specializations of
> locale facets that are not required by the standard and this suffices for
> basic usage to provoke compilation errors. I have not yet investigated to
> what extent the Microsoft and libstdc++ implementations work as might be
> expected. My impression is that, where they do produce expected results, it
> is serendipity at work. See https://godbolt.org/z/6T7hebY33 for a bit of
> fun (testing on Windows requires changes to use an actual zero valued file
> since Windows doesn't provide a builtin analog for /dev/zero, but in that
> case, MSVC produces an executable that behaves as might be expected).
>
> I haven't looked hard, but I have not yet identified any code in the wild
> that uses iostreams with charN_t types. One would think that, if any
> project did, it would be ICU. I confirmed that ICU, despite its use of
> char16_t, makes no attempt to use it with iostreams.
>
> So where is this all going? I see three general options that can be
> pursued to resolve these various issues.
>
> 1. We can fix these issues, despite the acknowledged ABI impact, so
> that the standard no longer actively hiders support for iostreams with the
> charN_t types. Optionally, we could further explore requiring such
> support in the standard (doing so would require adding charN_t support
> to more locale facets).
>
> With respect to support for iostreams with charN_t types requiring added
> support for more locale facets, please note that extending support for
> std::format() to charN_t types would presumably also require adding
> support to most, if not all, of the same locale facets.
>

Well, there is no design for that yet and there are many levels to it.
If we only want to print charN_t (which is a high priority item, that we
can't output Unicode is more than sad) - we only need to specify a
conversion, for which facets are not needed.
If we wanted to support all the formatting that unicode allows, std::locale
and accompanying facets are wholly insufficient.
So if we need to specify conversion to/from the execution encoding - for
iostream or format - we could do so in a way that does not on
codecvt/facets/

Tom.
>
>
> 1. We can declare that iostreams will never support the charN_t types
> in the standard and deprecate and remove the fragments of such support that
> are present. Implementations could of course provide support as an
> extension if they so desire.
> 2. We can admit things are broken, choose to do nothing about it, and
> close the related LWG issues while chanting sorry-not-sorry.
>
> The above issues are sufficiently complicated that I believe a paper is
> warranted regardless of the direction that we favor. I'm signing up to
> write that paper since I'm responsible for some of the mess. I do not
> intend to poll any directions in this meeting; rather, the focus is to
> ensure that the issues are well understood, to discuss decisions we could
> make and their potential consequences, and to generally collect information
> that will lead to a better paper.
>
> Responses provided before the meeting to identify other existing related
> issues or considerations would be appreciated. Ideal responses do not
> include the phrase "burn it all to the ground".
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2023-10-25 19:26:34