Date: Wed, 25 Oct 2023 09:24:37 -0700
Option 4 looks like the worst imaginable because it opens even more
possibilities for divergence between vendors that we'll have to deal with
later.
- Victor
On Tue, Oct 24, 2023 at 7:55 AM Alisdair Meredith via SG16 <
sg16_at_[hidden]> wrote:
> I’m going to throw in:
>
> Option 4:
>
> Remove the Unicode-type specialisations for parts that are needed only for
> iostreams support, so that vendors are free to implement them as they see
> fit as part of their own extensions for Unicode support in iostreams.
>
> If vendors report good success implementing full support for Unicode types
> in <iostreams>, then we should consider a paper to extend that support into
> the standard in the future.
>
> Option 5: (the pessimist)
>
> Deprecate iostreams! I have heard this mooted several times over the
> years,
> but lacking a replacement facility I don’t think this suggestion can go
> anywhere.
>
> AlisdairM
>
> > On Oct 24, 2023, at 1:11 AM, Tom Honermann <tom_at_[hidden]> wrote:
> >
> > SG16 will hold a telecon on Wednesday, October 25th, at 19:30 UTC
> (timezone conversion).
> > The agenda follows.
> > • charN_t, char_traits, codecvt, and iostreams:
> > • P2873R0: Remove Deprecated Locale Category Facets For Unicode
> from C++26
> > • LWG 3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly
> added to locale
> > • LWG 2959: char_traits<char16_t>::eof is a valid UTF-16 code
> unit
> > • SG16 #32: std::char_traits<char16_t>::eof() requires
> uint_least16_t to be larger than 16 bits
> > • SG16 #33: A correct codecvt facet that works with
> basic_filebuf can't do UTF conversions
> > Hang on, this is going to be a bumpy ride.
> > When char16_t and char32_t were added for C++11, the standard library
> was extended to support corresponding specializations of std::char_traits
> ([char.traits.general]p1) and std::basic_string
> ([string.classes.general]p1). Curiously, type aliases were added for
> specializations of the std::fpos ([iosfwd.syn]) class template (but only in
> the synopsis) and support for these types was added for the std::codecvt
> ([tab:locale.category.facets]) and std::codecvt_byname ([tab:locale.spec])
> locale facets, but not for any of the other locale facets nor for iostreams
> in general. Support for these types was added to std::basic_string_view
> ([string.view.synop]) and std::filesystem::path ([fs.path.type.cvt]p2) in
> C++17, but no additional support was ever extended to iostreams. The status
> quo is thus that the standard requires implementations to provide some
> fragments (std::fpos, std::codecvt, and std::codecvt_byname) of iostream
> support for these types despite there being no use of these type aliases
> and specializations in the standard; implementations are not required to
> support streams of char16_t or char32_t.
> > std::char_traits is used by both the string library (e.g.,
> std::basic_string) and iostreams. However, the string library only depends
> on some of the std::char_traits members; it does not make use of the
> int_type member type alias nor any of the member functions that depend on
> that type (eof(), not_eof(), to_char_type(), to_int_type(),
> eq_int_type()). Per LWG 2959 and SG16 #32, the specified
> std::char_traits<char16_t> specialization has a defect; all char16_t values
> are valid code unit values, but the int_type member type alias is defined
> as uint_least16_t (the same underlying type as char16_t) and it is thus
> unable to hold a distinct value for EOF. The obvious fix is to use a larger
> type for int_type, but that would result in an ABI break. I recently asked
> the ABI review group if there are any known tricks they could deploy to
> mitigate an ABI break, but no direct solutions were identified; a
> suggestion to provide an alternative type for std::char_traits<char16_t>
> that programmers would have to explicitly use instead of the broken
> specialization was offered. That is an option, but since the problematic
> int_type member is not actually used by any functionality the standard
> requires implementors to provide, an ABI break in this case might have
> little practical consequence.
> > When char8_t was added for C++20 via P0482R6 (char8_t: A type for UTF-8
> characters and strings), I failed to understand the intended purpose for
> which std::codecvt was added to the standard. My impression of it at the
> time was that it was a poorly designed general transcoding facility; I
> failed to appreciate its significance as a locale facet as used by
> iostreams. This resulted in two mistakes:
> > • I deprecated the following specializations (and their use as
> locale category facets):
> > std::codecvt<char16_t, char, std::mbstate_t>
> > std::codecvt<char32_t, char, std::mbstate_t>
> > std::codecvt_byname<char16_t, char, std::mbstate_t>
> > std::codecvt_byname<char32_t, char, std::mbstate_t>
> > • I added the following specializations as required locale category
> facets (adding the specializations themselves is arguably not a mistake,
> but adding them as locale category facets is):
> > std::codecvt<char16_t, char8_t, std::mbstate_t>
> > std::codecvt<char32_t, char8_t, std::mbstate_t>
> > std::codecvt_byname<char16_t, char8_t, std::mbstate_t>
> > std::codecvt_byname<char32_t, char8_t, std::mbstate_t>
> > Note that std::codecvt facets are only used by std::basic_filebuf which
> only ever converts to and from elements of type char; the facets that
> convert to and from char8_t are not substitutable for that purpose.
> >
> > P2873R0, which SG16 already approved (or, rather, did not object to)
> during the 2023-05-26 SG16 meeting, now seeks to remove the deprecated
> specializations. LWG 3767 tracks addressing the incorrect addition of the
> char8_t specializations as locale facets.
> > Arguably, P0482R6 should have added the following specializations as
> locale facets:
> > • std::codecvt<char8_t, char, std::mbstate_t>
> > • std::codecvt_byname<char8_t, char, std::mbstate_t>
> > The only specification for std::codecvt_byname in the standard is the
> synopsis in [locale.codecvt.byname]; there is no other wording present.
> > As mentioned, the standard does not require implementations to provide
> iostream support for the charN_t types. However, implementations may do so
> as an extension. If they do, then, per [filebuf.general]p7, specializations
> of std::codecvt<charN_t, char, std::mbstate_t> are required to be available
> via a call to std::use_facet() for the imbued locale. In which case, per
> the standard, the status of the necessary specializations are:
> > • std::codecvt<char8_t, char, std::mbstate_t> # Not specified.
> > • std::codecvt<char16_t, char, std::mbstate_t> # Deprecated.
> > • std::codecvt<char32_t, char, std::mbstate_t> # Deprecated.
> > If it is desirable to provide a better foundation for iostream support
> of the charN_t types, either for a future version of the standard, or for
> implementations that want to provide such support as an extension, we could
> undeprecate the previously deprecated specializations and add the missing
> one for char8_t. Since iostreams does not support charN_t in the standard
> today and since the char16_t and char32_t specializations have already been
> deprecated for two release cycles, perhaps it is even reasonable to change
> their behavior so that they convert to and from the locale encoding rather
> than UTF-8. This would remove the existing inconsistency with the
> corresponding char and wchar_t specializations that was part of the
> motivation for their deprecation in the first place (see the discussion of
> codecvt in the Motivation section of P0482R6).
> > However, an endeavor to improve the situation for iostreams and charN_t
> next runs into SG16 #33; std::basic_fstream does not support the UTF-8 and
> UTF-16 encodings for the "internal" side of a std::codecvt conversion
> because std::basic_filebuf requires that, per [locale.codecvt.virtuals]p4
> and its related footnote, "internal" characters are mapped 1-N to
> "external" characters. This is an existing issue for
> std::basic_fstream<wchar_t> with UTF-16 data.
> > The Microsoft and libstdc++ standard library implementations appear to
> support iostreams with charN_t types; at least on the surface. Libc++
> intentionally does not provide definitions for charN_t specializations of
> locale facets that are not required by the standard and this suffices for
> basic usage to provoke compilation errors. I have not yet investigated to
> what extent the Microsoft and libstdc++ implementations work as might be
> expected. My impression is that, where they do produce expected results, it
> is serendipity at work. See https://godbolt.org/z/6T7hebY33 for a bit of
> fun (testing on Windows requires changes to use an actual zero valued file
> since Windows doesn't provide a builtin analog for /dev/zero, but in that
> case, MSVC produces an executable that behaves as might be expected).
> > I haven't looked hard, but I have not yet identified any code in the
> wild that uses iostreams with charN_t types. One would think that, if any
> project did, it would be ICU. I confirmed that ICU, despite its use of
> char16_t, makes no attempt to use it with iostreams.
> > So where is this all going? I see three general options that can be
> pursued to resolve these various issues.
> > • We can fix these issues, despite the acknowledged ABI impact, so
> that the standard no longer actively hiders support for iostreams with the
> charN_t types. Optionally, we could further explore requiring such support
> in the standard (doing so would require adding charN_t support to more
> locale facets).
> > • We can declare that iostreams will never support the charN_t types
> in the standard and deprecate and remove the fragments of such support that
> are present. Implementations could of course provide support as an
> extension if they so desire.
> > • We can admit things are broken, choose to do nothing about it, and
> close the related LWG issues while chanting sorry-not-sorry.
> > The above issues are sufficiently complicated that I believe a paper is
> warranted regardless of the direction that we favor. I'm signing up to
> write that paper since I'm responsible for some of the mess. I do not
> intend to poll any directions in this meeting; rather, the focus is to
> ensure that the issues are well understood, to discuss decisions we could
> make and their potential consequences, and to generally collect information
> that will lead to a better paper.
> > Responses provided before the meeting to identify other existing related
> issues or considerations would be appreciated. Ideal responses do not
> include the phrase "burn it all to the ground".
> > Tom.
> >
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
possibilities for divergence between vendors that we'll have to deal with
later.
- Victor
On Tue, Oct 24, 2023 at 7:55 AM Alisdair Meredith via SG16 <
sg16_at_[hidden]> wrote:
> I’m going to throw in:
>
> Option 4:
>
> Remove the Unicode-type specialisations for parts that are needed only for
> iostreams support, so that vendors are free to implement them as they see
> fit as part of their own extensions for Unicode support in iostreams.
>
> If vendors report good success implementing full support for Unicode types
> in <iostreams>, then we should consider a paper to extend that support into
> the standard in the future.
>
> Option 5: (the pessimist)
>
> Deprecate iostreams! I have heard this mooted several times over the
> years,
> but lacking a replacement facility I don’t think this suggestion can go
> anywhere.
>
> AlisdairM
>
> > On Oct 24, 2023, at 1:11 AM, Tom Honermann <tom_at_[hidden]> wrote:
> >
> > SG16 will hold a telecon on Wednesday, October 25th, at 19:30 UTC
> (timezone conversion).
> > The agenda follows.
> > • charN_t, char_traits, codecvt, and iostreams:
> > • P2873R0: Remove Deprecated Locale Category Facets For Unicode
> from C++26
> > • LWG 3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly
> added to locale
> > • LWG 2959: char_traits<char16_t>::eof is a valid UTF-16 code
> unit
> > • SG16 #32: std::char_traits<char16_t>::eof() requires
> uint_least16_t to be larger than 16 bits
> > • SG16 #33: A correct codecvt facet that works with
> basic_filebuf can't do UTF conversions
> > Hang on, this is going to be a bumpy ride.
> > When char16_t and char32_t were added for C++11, the standard library
> was extended to support corresponding specializations of std::char_traits
> ([char.traits.general]p1) and std::basic_string
> ([string.classes.general]p1). Curiously, type aliases were added for
> specializations of the std::fpos ([iosfwd.syn]) class template (but only in
> the synopsis) and support for these types was added for the std::codecvt
> ([tab:locale.category.facets]) and std::codecvt_byname ([tab:locale.spec])
> locale facets, but not for any of the other locale facets nor for iostreams
> in general. Support for these types was added to std::basic_string_view
> ([string.view.synop]) and std::filesystem::path ([fs.path.type.cvt]p2) in
> C++17, but no additional support was ever extended to iostreams. The status
> quo is thus that the standard requires implementations to provide some
> fragments (std::fpos, std::codecvt, and std::codecvt_byname) of iostream
> support for these types despite there being no use of these type aliases
> and specializations in the standard; implementations are not required to
> support streams of char16_t or char32_t.
> > std::char_traits is used by both the string library (e.g.,
> std::basic_string) and iostreams. However, the string library only depends
> on some of the std::char_traits members; it does not make use of the
> int_type member type alias nor any of the member functions that depend on
> that type (eof(), not_eof(), to_char_type(), to_int_type(),
> eq_int_type()). Per LWG 2959 and SG16 #32, the specified
> std::char_traits<char16_t> specialization has a defect; all char16_t values
> are valid code unit values, but the int_type member type alias is defined
> as uint_least16_t (the same underlying type as char16_t) and it is thus
> unable to hold a distinct value for EOF. The obvious fix is to use a larger
> type for int_type, but that would result in an ABI break. I recently asked
> the ABI review group if there are any known tricks they could deploy to
> mitigate an ABI break, but no direct solutions were identified; a
> suggestion to provide an alternative type for std::char_traits<char16_t>
> that programmers would have to explicitly use instead of the broken
> specialization was offered. That is an option, but since the problematic
> int_type member is not actually used by any functionality the standard
> requires implementors to provide, an ABI break in this case might have
> little practical consequence.
> > When char8_t was added for C++20 via P0482R6 (char8_t: A type for UTF-8
> characters and strings), I failed to understand the intended purpose for
> which std::codecvt was added to the standard. My impression of it at the
> time was that it was a poorly designed general transcoding facility; I
> failed to appreciate its significance as a locale facet as used by
> iostreams. This resulted in two mistakes:
> > • I deprecated the following specializations (and their use as
> locale category facets):
> > std::codecvt<char16_t, char, std::mbstate_t>
> > std::codecvt<char32_t, char, std::mbstate_t>
> > std::codecvt_byname<char16_t, char, std::mbstate_t>
> > std::codecvt_byname<char32_t, char, std::mbstate_t>
> > • I added the following specializations as required locale category
> facets (adding the specializations themselves is arguably not a mistake,
> but adding them as locale category facets is):
> > std::codecvt<char16_t, char8_t, std::mbstate_t>
> > std::codecvt<char32_t, char8_t, std::mbstate_t>
> > std::codecvt_byname<char16_t, char8_t, std::mbstate_t>
> > std::codecvt_byname<char32_t, char8_t, std::mbstate_t>
> > Note that std::codecvt facets are only used by std::basic_filebuf which
> only ever converts to and from elements of type char; the facets that
> convert to and from char8_t are not substitutable for that purpose.
> >
> > P2873R0, which SG16 already approved (or, rather, did not object to)
> during the 2023-05-26 SG16 meeting, now seeks to remove the deprecated
> specializations. LWG 3767 tracks addressing the incorrect addition of the
> char8_t specializations as locale facets.
> > Arguably, P0482R6 should have added the following specializations as
> locale facets:
> > • std::codecvt<char8_t, char, std::mbstate_t>
> > • std::codecvt_byname<char8_t, char, std::mbstate_t>
> > The only specification for std::codecvt_byname in the standard is the
> synopsis in [locale.codecvt.byname]; there is no other wording present.
> > As mentioned, the standard does not require implementations to provide
> iostream support for the charN_t types. However, implementations may do so
> as an extension. If they do, then, per [filebuf.general]p7, specializations
> of std::codecvt<charN_t, char, std::mbstate_t> are required to be available
> via a call to std::use_facet() for the imbued locale. In which case, per
> the standard, the status of the necessary specializations are:
> > • std::codecvt<char8_t, char, std::mbstate_t> # Not specified.
> > • std::codecvt<char16_t, char, std::mbstate_t> # Deprecated.
> > • std::codecvt<char32_t, char, std::mbstate_t> # Deprecated.
> > If it is desirable to provide a better foundation for iostream support
> of the charN_t types, either for a future version of the standard, or for
> implementations that want to provide such support as an extension, we could
> undeprecate the previously deprecated specializations and add the missing
> one for char8_t. Since iostreams does not support charN_t in the standard
> today and since the char16_t and char32_t specializations have already been
> deprecated for two release cycles, perhaps it is even reasonable to change
> their behavior so that they convert to and from the locale encoding rather
> than UTF-8. This would remove the existing inconsistency with the
> corresponding char and wchar_t specializations that was part of the
> motivation for their deprecation in the first place (see the discussion of
> codecvt in the Motivation section of P0482R6).
> > However, an endeavor to improve the situation for iostreams and charN_t
> next runs into SG16 #33; std::basic_fstream does not support the UTF-8 and
> UTF-16 encodings for the "internal" side of a std::codecvt conversion
> because std::basic_filebuf requires that, per [locale.codecvt.virtuals]p4
> and its related footnote, "internal" characters are mapped 1-N to
> "external" characters. This is an existing issue for
> std::basic_fstream<wchar_t> with UTF-16 data.
> > The Microsoft and libstdc++ standard library implementations appear to
> support iostreams with charN_t types; at least on the surface. Libc++
> intentionally does not provide definitions for charN_t specializations of
> locale facets that are not required by the standard and this suffices for
> basic usage to provoke compilation errors. I have not yet investigated to
> what extent the Microsoft and libstdc++ implementations work as might be
> expected. My impression is that, where they do produce expected results, it
> is serendipity at work. See https://godbolt.org/z/6T7hebY33 for a bit of
> fun (testing on Windows requires changes to use an actual zero valued file
> since Windows doesn't provide a builtin analog for /dev/zero, but in that
> case, MSVC produces an executable that behaves as might be expected).
> > I haven't looked hard, but I have not yet identified any code in the
> wild that uses iostreams with charN_t types. One would think that, if any
> project did, it would be ICU. I confirmed that ICU, despite its use of
> char16_t, makes no attempt to use it with iostreams.
> > So where is this all going? I see three general options that can be
> pursued to resolve these various issues.
> > • We can fix these issues, despite the acknowledged ABI impact, so
> that the standard no longer actively hiders support for iostreams with the
> charN_t types. Optionally, we could further explore requiring such support
> in the standard (doing so would require adding charN_t support to more
> locale facets).
> > • We can declare that iostreams will never support the charN_t types
> in the standard and deprecate and remove the fragments of such support that
> are present. Implementations could of course provide support as an
> extension if they so desire.
> > • We can admit things are broken, choose to do nothing about it, and
> close the related LWG issues while chanting sorry-not-sorry.
> > The above issues are sufficiently complicated that I believe a paper is
> warranted regardless of the direction that we favor. I'm signing up to
> write that paper since I'm responsible for some of the mess. I do not
> intend to poll any directions in this meeting; rather, the focus is to
> ensure that the issues are well understood, to discuss decisions we could
> make and their potential consequences, and to generally collect information
> that will lead to a better paper.
> > Responses provided before the meeting to identify other existing related
> issues or considerations would be appreciated. Ideal responses do not
> include the phrase "burn it all to the ground".
> > Tom.
> >
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2023-10-25 16:24:50