Date: Wed, 25 Oct 2023 10:56:50 -0400
On 10/24/23 1:11 AM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a telecon on Wednesday, October 25th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20231025T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>).
>
> The agenda follows.
>
> * charN_t, char_traits, codecvt, and iostreams:
> o P2873R0: Remove Deprecated Locale Category Facets For Unicode
> from C++26 <https://wg21.link/p2873r0>
> o LWG 3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly
> added to locale <https://wg21.link/lwg3767>
> o LWG 2959: char_traits<char16_t>::eof is a valid UTF-16 code
> unit <https://wg21.link/lwg2959>
> + SG16 #32: std::char_traits<char16_t>::eof() requires
> uint_least16_t to be larger than 16 bits
> <https://github.com/sg16-unicode/sg16/issues/32>
> o SG16 #33: A correct codecvt facet that works with
> basic_filebuf can't do UTF conversions
> <https://github.com/sg16-unicode/sg16/issues/33>
>
> Hang on, this is going to be a bumpy ride.
>
> When char16_t and char32_t were added for C++11, the standard library
> was extended to support corresponding specializations of
> std::char_traits ([char.traits.general]p1
> <http://eel.is/c++draft/char.traits.general#1>) and std::basic_string
> ([string.classes.general]p1
> <http://eel.is/c++draft/string.classes#general-1>). Curiously, type
> aliases were added for specializations of the std::fpos ([iosfwd.syn]
> <http://eel.is/c++draft/iosfwd.syn#lib:fpos>) class template (but only
> in the synopsis) and support for these types was added for the
> std::codecvt ([tab:locale.category.facets]
> <http://eel.is/c++draft/locale.category#tab:locale.category.facets>)
> and std::codecvt_byname ([tab:locale.spec]
> <http://eel.is/c++draft/locale.category#tab:locale.spec>) locale
> facets, but not for any of the other locale facets nor for iostreams
> in general. Support for these types was added to
> std::basic_string_view ([string.view.synop]
> <http://eel.is/c++draft/string.view.synop>) and std::filesystem::path
> ([fs.path.type.cvt]p2 <http://eel.is/c++draft/fs.path.type.cvt#2>) in
> C++17, but no additional support was ever extended to iostreams. The
> status quo is thus that the standard requires implementations to
> provide some fragments (std::fpos, std::codecvt, and
> std::codecvt_byname) of iostream support for these types despite there
> being no use of these type aliases and specializations in the
> standard; implementations are not required to support streams of
> char16_t or char32_t.
>
> std::char_traits is used by both the string library (e.g.,
> std::basic_string) and iostreams. However, the string library only
> depends on some of the std::char_traits members; it does not make use
> of the int_type member type alias nor any of the member functions that
> depend on that type (eof(), not_eof(), to_char_type(),
> to_int_type(), eq_int_type()). Per LWG 2959
> <https://wg21.link/lwg2959> and SG16 #32
> <https://github.com/sg16-unicode/sg16/issues/32>, the specified
> std::char_traits<char16_t> specialization has a defect; all char16_t
> values are valid code unit values, but the int_type member type alias
> is defined as uint_least16_t (the same underlying type as char16_t)
> and it is thus unable to hold a distinct value for EOF. The obvious
> fix is to use a larger type for int_type, but that would result in an
> ABI break. I recently asked the ABI review group if there are any
> known tricks they could deploy to mitigate an ABI break, but no direct
> solutions were identified; a suggestion to provide an alternative type
> for std::char_traits<char16_t> that programmers would have to
> explicitly use instead of the broken specialization was offered. That
> is an option, but since the problematic int_type member is not
> actually used by any functionality the standard requires implementors
> to provide, an ABI break in this case might have little practical
> consequence.
>
> When char8_t was added for C++20 via P0482R6 (char8_t: A type for
> UTF-8 characters and strings) <https://wg21.link/p0482>, I failed to
> understand the intended purpose for which std::codecvt was added to
> the standard. My impression of it at the time was that it was a poorly
> designed general transcoding facility; I failed to appreciate its
> significance as a locale facet as used by iostreams. This resulted in
> two mistakes:
>
> 1. I deprecated the following specializations (and their use as
> locale category facets):
> std::codecvt<char16_t, char, std::mbstate_t>
> std::codecvt<char32_t, char, std::mbstate_t>
> std::codecvt_byname<char16_t, char, std::mbstate_t>
> std::codecvt_byname<char32_t, char, std::mbstate_t>
> 2. I added the following specializations as required locale category
> facets (adding the specializations themselves is arguably not a
> mistake, but adding them as locale category facets is):
> std::codecvt<char16_t, char8_t, std::mbstate_t>
> std::codecvt<char32_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char16_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char32_t, char8_t, std::mbstate_t>
>
> Note that std::codecvt facets are only used by std::basic_filebuf
> which only ever converts to and from elements of type char; the facets
> that convert to and from char8_t are not substitutable for that purpose.
>
> P2873R0 <https://wg21.link/p2873r0>, which SG16 already approved (or,
> rather, did not object to) during the 2023-05-26 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#may-24th-2023>, now
> seeks to remove the deprecated specializations. LWG 3767
> <https://wg21.link/lwg3767> tracks addressing the incorrect addition
> of the char8_t specializations as locale facets.
>
> Arguably, P0482R6 <https://wg21.link/p0482> should have added the
> following specializations as locale facets:
>
> * std::codecvt<char8_t, char, std::mbstate_t>
> * std::codecvt_byname<char8_t, char, std::mbstate_t>
>
> The only specification for std::codecvt_byname in the standard is the
> synopsis in [locale.codecvt.byname]
> <http://eel.is/c++draft/locale.codecvt.byname>; there is no other
> wording present.
>
> As mentioned, the standard does not require implementations to provide
> iostream support for the charN_t types. However, implementations may
> do so as an extension. If they do, then, per [filebuf.general]p7
> <http://eel.is/c++draft/input.output#filebuf.general-7>,
> specializations of std::codecvt<charN_t, char, std::mbstate_t> are
> required to be available via a call to std::use_facet() for the imbued
> locale. In which case, per the standard, the status of the necessary
> specializations are:
>
> * std::codecvt<char8_t, char, std::mbstate_t> # Not specified.
> * std::codecvt<char16_t, char, std::mbstate_t> # Deprecated.
> * std::codecvt<char32_t, char, std::mbstate_t> # Deprecated.
>
> If it is desirable to provide a better foundation for iostream support
> of the charN_t types, either for a future version of the standard, or
> for implementations that want to provide such support as an extension,
> we could undeprecate the previously deprecated specializations and add
> the missing one for char8_t. Since iostreams does not support charN_t
> in the standard today and since the char16_t and char32_t
> specializations have already been deprecated for two release cycles,
> perhaps it is even reasonable to change their behavior so that they
> convert to and from the locale encoding rather than UTF-8. This would
> remove the existing inconsistency with the corresponding char and
> wchar_t specializations that was part of the motivation for their
> deprecation in the first place (see the discussion of codecvt in the
> Motivation section of P0482R6 <https://wg21.link/p0482r6#motivation>).
>
> However, an endeavor to improve the situation for iostreams and
> charN_t next runs into SG16 #33
> <https://github.com/sg16-unicode/sg16/issues/33>; std::basic_fstream
> does not support the UTF-8 and UTF-16 encodings for the "internal"
> side of a std::codecvt conversion because std::basic_filebuf requires
> that, per [locale.codecvt.virtuals]p4
> <http://eel.is/c++draft/locale.codecvt#virtuals-4> and its related
> footnote <http://eel.is/c++draft/locale.codecvt#footnote-246>,
> "internal" characters are mapped 1-N to "external" characters. This is
> an existing issue for std::basic_fstream<wchar_t> with UTF-16 data.
>
> The Microsoft and libstdc++ standard library implementations appear to
> support iostreams with charN_t types; at least on the surface. Libc++
> intentionally does not provide definitions for charN_t specializations
> of locale facets that are not required by the standard and this
> suffices for basic usage to provoke compilation errors. I have not yet
> investigated to what extent the Microsoft and libstdc++
> implementations work as might be expected. My impression is that,
> where they do produce expected results, it is serendipity at work. See
> https://godbolt.org/z/6T7hebY33 for a bit of fun (testing on Windows
> requires changes to use an actual zero valued file since Windows
> doesn't provide a builtin analog for /dev/zero, but in that case, MSVC
> produces an executable that behaves as might be expected).
>
> I haven't looked hard, but I have not yet identified any code in the
> wild that uses iostreams with charN_t types. One would think that, if
> any project did, it would be ICU. I confirmed that ICU, despite its
> use of char16_t, makes no attempt to use it with iostreams.
>
> So where is this all going? I see three general options that can be
> pursued to resolve these various issues.
>
> 1. We can fix these issues, despite the acknowledged ABI impact, so
> that the standard no longer actively hiders support for iostreams
> with the charN_t types. Optionally, we could further explore
> requiring such support in the standard (doing so would require
> adding charN_t support to more locale facets).
>
With respect to support for iostreams with charN_t types requiring added
support for more locale facets, please note that extending support for
std::format() to charN_t types would presumably also require adding
support to most, if not all, of the same locale facets.
Tom.
> 1. We can declare that iostreams will never support the charN_t types
> in the standard and deprecate and remove the fragments of such
> support that are present. Implementations could of course provide
> support as an extension if they so desire.
> 2. We can admit things are broken, choose to do nothing about it, and
> close the related LWG issues while chanting sorry-not-sorry.
>
> The above issues are sufficiently complicated that I believe a paper
> is warranted regardless of the direction that we favor. I'm signing up
> to write that paper since I'm responsible for some of the mess. I do
> not intend to poll any directions in this meeting; rather, the focus
> is to ensure that the issues are well understood, to discuss decisions
> we could make and their potential consequences, and to generally
> collect information that will lead to a better paper.
>
> Responses provided before the meeting to identify other existing
> related issues or considerations would be appreciated. Ideal responses
> do not include the phrase "burn it all to the ground".
>
> Tom.
>
>
> SG16 will hold a telecon on Wednesday, October 25th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20231025T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>).
>
> The agenda follows.
>
> * charN_t, char_traits, codecvt, and iostreams:
> o P2873R0: Remove Deprecated Locale Category Facets For Unicode
> from C++26 <https://wg21.link/p2873r0>
> o LWG 3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly
> added to locale <https://wg21.link/lwg3767>
> o LWG 2959: char_traits<char16_t>::eof is a valid UTF-16 code
> unit <https://wg21.link/lwg2959>
> + SG16 #32: std::char_traits<char16_t>::eof() requires
> uint_least16_t to be larger than 16 bits
> <https://github.com/sg16-unicode/sg16/issues/32>
> o SG16 #33: A correct codecvt facet that works with
> basic_filebuf can't do UTF conversions
> <https://github.com/sg16-unicode/sg16/issues/33>
>
> Hang on, this is going to be a bumpy ride.
>
> When char16_t and char32_t were added for C++11, the standard library
> was extended to support corresponding specializations of
> std::char_traits ([char.traits.general]p1
> <http://eel.is/c++draft/char.traits.general#1>) and std::basic_string
> ([string.classes.general]p1
> <http://eel.is/c++draft/string.classes#general-1>). Curiously, type
> aliases were added for specializations of the std::fpos ([iosfwd.syn]
> <http://eel.is/c++draft/iosfwd.syn#lib:fpos>) class template (but only
> in the synopsis) and support for these types was added for the
> std::codecvt ([tab:locale.category.facets]
> <http://eel.is/c++draft/locale.category#tab:locale.category.facets>)
> and std::codecvt_byname ([tab:locale.spec]
> <http://eel.is/c++draft/locale.category#tab:locale.spec>) locale
> facets, but not for any of the other locale facets nor for iostreams
> in general. Support for these types was added to
> std::basic_string_view ([string.view.synop]
> <http://eel.is/c++draft/string.view.synop>) and std::filesystem::path
> ([fs.path.type.cvt]p2 <http://eel.is/c++draft/fs.path.type.cvt#2>) in
> C++17, but no additional support was ever extended to iostreams. The
> status quo is thus that the standard requires implementations to
> provide some fragments (std::fpos, std::codecvt, and
> std::codecvt_byname) of iostream support for these types despite there
> being no use of these type aliases and specializations in the
> standard; implementations are not required to support streams of
> char16_t or char32_t.
>
> std::char_traits is used by both the string library (e.g.,
> std::basic_string) and iostreams. However, the string library only
> depends on some of the std::char_traits members; it does not make use
> of the int_type member type alias nor any of the member functions that
> depend on that type (eof(), not_eof(), to_char_type(),
> to_int_type(), eq_int_type()). Per LWG 2959
> <https://wg21.link/lwg2959> and SG16 #32
> <https://github.com/sg16-unicode/sg16/issues/32>, the specified
> std::char_traits<char16_t> specialization has a defect; all char16_t
> values are valid code unit values, but the int_type member type alias
> is defined as uint_least16_t (the same underlying type as char16_t)
> and it is thus unable to hold a distinct value for EOF. The obvious
> fix is to use a larger type for int_type, but that would result in an
> ABI break. I recently asked the ABI review group if there are any
> known tricks they could deploy to mitigate an ABI break, but no direct
> solutions were identified; a suggestion to provide an alternative type
> for std::char_traits<char16_t> that programmers would have to
> explicitly use instead of the broken specialization was offered. That
> is an option, but since the problematic int_type member is not
> actually used by any functionality the standard requires implementors
> to provide, an ABI break in this case might have little practical
> consequence.
>
> When char8_t was added for C++20 via P0482R6 (char8_t: A type for
> UTF-8 characters and strings) <https://wg21.link/p0482>, I failed to
> understand the intended purpose for which std::codecvt was added to
> the standard. My impression of it at the time was that it was a poorly
> designed general transcoding facility; I failed to appreciate its
> significance as a locale facet as used by iostreams. This resulted in
> two mistakes:
>
> 1. I deprecated the following specializations (and their use as
> locale category facets):
> std::codecvt<char16_t, char, std::mbstate_t>
> std::codecvt<char32_t, char, std::mbstate_t>
> std::codecvt_byname<char16_t, char, std::mbstate_t>
> std::codecvt_byname<char32_t, char, std::mbstate_t>
> 2. I added the following specializations as required locale category
> facets (adding the specializations themselves is arguably not a
> mistake, but adding them as locale category facets is):
> std::codecvt<char16_t, char8_t, std::mbstate_t>
> std::codecvt<char32_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char16_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char32_t, char8_t, std::mbstate_t>
>
> Note that std::codecvt facets are only used by std::basic_filebuf
> which only ever converts to and from elements of type char; the facets
> that convert to and from char8_t are not substitutable for that purpose.
>
> P2873R0 <https://wg21.link/p2873r0>, which SG16 already approved (or,
> rather, did not object to) during the 2023-05-26 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#may-24th-2023>, now
> seeks to remove the deprecated specializations. LWG 3767
> <https://wg21.link/lwg3767> tracks addressing the incorrect addition
> of the char8_t specializations as locale facets.
>
> Arguably, P0482R6 <https://wg21.link/p0482> should have added the
> following specializations as locale facets:
>
> * std::codecvt<char8_t, char, std::mbstate_t>
> * std::codecvt_byname<char8_t, char, std::mbstate_t>
>
> The only specification for std::codecvt_byname in the standard is the
> synopsis in [locale.codecvt.byname]
> <http://eel.is/c++draft/locale.codecvt.byname>; there is no other
> wording present.
>
> As mentioned, the standard does not require implementations to provide
> iostream support for the charN_t types. However, implementations may
> do so as an extension. If they do, then, per [filebuf.general]p7
> <http://eel.is/c++draft/input.output#filebuf.general-7>,
> specializations of std::codecvt<charN_t, char, std::mbstate_t> are
> required to be available via a call to std::use_facet() for the imbued
> locale. In which case, per the standard, the status of the necessary
> specializations are:
>
> * std::codecvt<char8_t, char, std::mbstate_t> # Not specified.
> * std::codecvt<char16_t, char, std::mbstate_t> # Deprecated.
> * std::codecvt<char32_t, char, std::mbstate_t> # Deprecated.
>
> If it is desirable to provide a better foundation for iostream support
> of the charN_t types, either for a future version of the standard, or
> for implementations that want to provide such support as an extension,
> we could undeprecate the previously deprecated specializations and add
> the missing one for char8_t. Since iostreams does not support charN_t
> in the standard today and since the char16_t and char32_t
> specializations have already been deprecated for two release cycles,
> perhaps it is even reasonable to change their behavior so that they
> convert to and from the locale encoding rather than UTF-8. This would
> remove the existing inconsistency with the corresponding char and
> wchar_t specializations that was part of the motivation for their
> deprecation in the first place (see the discussion of codecvt in the
> Motivation section of P0482R6 <https://wg21.link/p0482r6#motivation>).
>
> However, an endeavor to improve the situation for iostreams and
> charN_t next runs into SG16 #33
> <https://github.com/sg16-unicode/sg16/issues/33>; std::basic_fstream
> does not support the UTF-8 and UTF-16 encodings for the "internal"
> side of a std::codecvt conversion because std::basic_filebuf requires
> that, per [locale.codecvt.virtuals]p4
> <http://eel.is/c++draft/locale.codecvt#virtuals-4> and its related
> footnote <http://eel.is/c++draft/locale.codecvt#footnote-246>,
> "internal" characters are mapped 1-N to "external" characters. This is
> an existing issue for std::basic_fstream<wchar_t> with UTF-16 data.
>
> The Microsoft and libstdc++ standard library implementations appear to
> support iostreams with charN_t types; at least on the surface. Libc++
> intentionally does not provide definitions for charN_t specializations
> of locale facets that are not required by the standard and this
> suffices for basic usage to provoke compilation errors. I have not yet
> investigated to what extent the Microsoft and libstdc++
> implementations work as might be expected. My impression is that,
> where they do produce expected results, it is serendipity at work. See
> https://godbolt.org/z/6T7hebY33 for a bit of fun (testing on Windows
> requires changes to use an actual zero valued file since Windows
> doesn't provide a builtin analog for /dev/zero, but in that case, MSVC
> produces an executable that behaves as might be expected).
>
> I haven't looked hard, but I have not yet identified any code in the
> wild that uses iostreams with charN_t types. One would think that, if
> any project did, it would be ICU. I confirmed that ICU, despite its
> use of char16_t, makes no attempt to use it with iostreams.
>
> So where is this all going? I see three general options that can be
> pursued to resolve these various issues.
>
> 1. We can fix these issues, despite the acknowledged ABI impact, so
> that the standard no longer actively hiders support for iostreams
> with the charN_t types. Optionally, we could further explore
> requiring such support in the standard (doing so would require
> adding charN_t support to more locale facets).
>
With respect to support for iostreams with charN_t types requiring added
support for more locale facets, please note that extending support for
std::format() to charN_t types would presumably also require adding
support to most, if not all, of the same locale facets.
Tom.
> 1. We can declare that iostreams will never support the charN_t types
> in the standard and deprecate and remove the fragments of such
> support that are present. Implementations could of course provide
> support as an extension if they so desire.
> 2. We can admit things are broken, choose to do nothing about it, and
> close the related LWG issues while chanting sorry-not-sorry.
>
> The above issues are sufficiently complicated that I believe a paper
> is warranted regardless of the direction that we favor. I'm signing up
> to write that paper since I'm responsible for some of the mess. I do
> not intend to poll any directions in this meeting; rather, the focus
> is to ensure that the issues are well understood, to discuss decisions
> we could make and their potential consequences, and to generally
> collect information that will lead to a better paper.
>
> Responses provided before the meeting to identify other existing
> related issues or considerations would be appreciated. Ideal responses
> do not include the phrase "burn it all to the ground".
>
> Tom.
>
Received on 2023-10-25 14:56:54