C++ Logo

sg16

Advanced search

Re: SG16 Digest, Vol 48, Issue 7

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 24 Oct 2023 10:53:23 -0400
Hi, Zoran. Thank you for sharing your thoughts.

If you plan to contribute to the mailing list discussions, please
disable digest mode^[1] so that you can respond to individual messages.
Otherwise, your responses will not thread properly with others and will
be more difficult to follow.

There have been various proposals to provide more comprehensive encoding
and conversion services along the lines of what you outlined. None of
these has yet progressed to a complete proposal that is suitable for
adoption. Papers you might want to get familiar with include:

  * P0244: Text_view: A C++ concepts and range based character encoding
    and code point enumeration library <https://wg21.link/p0244>
  * P1629: Standard Text Encoding <https://wg21.link/p1629>
  * WG14 N3095: Restartable and Non-Restartable Functions for Efficient
    Character Conversions
    <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm>

The last one targets C and is intended to provide low level support that
a C++ encoding library like the ones discussed in P0244
<https://wg21.link/p0244> or P1629 <https://wg21.link/p1629> could use
internally. This proposal narrowly missed C23 but is poised to be
accepted for the next C revision.

Recently, we have been looking at the following paper which only seeks
to provide conversions between UTF encodings. It is designed to work
seamlessly with range and iterator libraries; it therefore looks quite
different from the general encoding libraries like ICU, iconv(),
Microsoft's APIs, or the libraries of other languages like Python.

  * P2728: Unicode in the Library, Part 1: UTF Transcoding
    <https://wg21.link/p2728>

Tom.

[1]: To disable digest mode, go to
https://lists.isocpp.org/mailman/listinfo.cgi/sg16 and click
"Unsubscribe or edit options". Enter your email address and the password
you used when you subscribed to the list and click "Log in". Scroll down
to find "Set Digest Mode", change the value to "Off", and then click
"Submit My Changes".

On 10/24/23 6:58 AM, Zoran Sibalic via SG16 wrote:
> Hi Tom,
>
> Well this will not be a 'burn it to the ground' option ( maybe ) but
> to be honestly just wondering there shouldn't be a much more elegant
> solution to whole Unicode text support inside C++ then what's now it
> seems to be trying to fix a huge mess created over long period of
> time. Like if all container classes inside STL were not created with
> templated allocators, so all code did it manually and then now when it
> was needed to change/add new allocators, people would need to
> change/add on all existing code base changes to make new allocators
> work ....
>
> So my suggestion and I will put a simple example here is to create a
> templated character type ( lets call it encoded character or echar in
> short ) that will hold an encoding template inside.
>
> template < class ENCODING >
> concept CHAR_ENCODING = requires( ENCODING::type * _charType,
> char32_t * _charUtf32, size_t size )
> {
> typename ENCODING::type;
> { ENCODING::Encode( _charType , size, _charUtf32, size ) } ->
> std::same_as<bool>;
> { ENCODING::Decode( _charUtf32, size, _charType , size ) } ->
> std::same_as<bool>;
> };
> //-/////////////////////////////////////////////////////////////////////////////////////////////////////-//
> #pragma pack(push, 1)
> //-/////////////////////////////////////////////////////////////////////////////////////////////////////-//
> template<CHAR_ENCODING Encoding>
> struct echar
> {
> using ENCODING = Encoding;
> using TYPE = Encoding::type;
> TYPE encoding_char;
> };
> //-/////////////////////////////////////////////////////////////////////////////////////////////////////-//
> #pragma pack(pop)
> //-/////////////////////////////////////////////////////////////////////////////////////////////////////-//
>
> Then creating encoding and new encoded character is simple and even
> easily added by average c++ user :
>
> class ASCII_ENCODING
> {
> public:
> using type = char;
> // Required by concept for all Encodings
> static constexpr bool Encode( char* pData, const size_t sizeData,
> const char32_t* pSource, const size_t sizeSource )
> { return false; };
> static constexpr bool Decode( char32_t* pData, const size_t sizeData,
> const char* pSource, const size_t sizeSource )
> { return false; };
> }; static_assert( CHAR_ENCODING<ASCII_ENCODING> == true );
> class UTF08_ENCODING
> {
> public:
> using type = char8_t;
> // Required by concept for all Encodings
> static constexpr bool Encode( char8_t* pData, const size_t sizeData,
> const char32_t* pSource, const size_t sizeSource )
> { return false; };
> static constexpr bool Decode( char32_t* pData, const size_t sizeData,
> const char8_t* pSource, const size_t sizeSource )
> { return false; };
> }; static_assert( CHAR_ENCODING<UTF08_ENCODING> == true );
>
> This is just basic simple encoding concept as it would required a lots
> of more static constexpr function for sure ( like to check for
> conversion size, validate text, compare texts case sensitive or case
> insensitive .... ) and all this functions can be required or optional
> and we can check at compile time for it availability and even test
> some of them at compile time as would require them all to be constexpr
> ....
>
> Then its simple to create new encoded char types:
> using ascii_char = echar<ASCII_ENCODING>;
> static_assert( sizeof(ascii_char) == sizeof(char ));
> using utf08_char = echar<UTF08_ENCODING>;
> static_assert( sizeof(utf08_char) == sizeof(char8_t));
>
> And usage atm is not perfect:
> const char* text1 = "ASCII text";
> const char8_t* text2 = u8"UTF08 text";
> const ascii_char* ascii_text = reinterpret_cast<const ascii_char*>(text1);
> const utf08_char* utf08_text = reinterpret_cast<const utf08_char*>(text2);
>
> But it's even work, so the idea is basically that for all other
> Standard Library Classes if they work with echar type ( strings,
> views, format, print ... ) all will work perfectly, even with encoding
> the user will add in his code...
>
> So even things like utf16_roman_char string will auto convert to
> utf16_korean_char string if requested ( 1st conversion to utf32 then
> 2nd conversion back to utf16 with korean encoding ). Ofc its not best
> performance wise to do double conversion but ofc we can always add
> constexpr function that can even optimize this and if check on compile
> time is true use them like for example :
> // Optional for faster encoding //
> inline constexpr bool FastEncode( std::span<utf08_char> Dest, const
> std::span<ascii_char> Source ) { return false; };
> inline constexpr bool FastEncode( std::span<ascii_char> Dest, const
> std::span<utf08_char> Source ) { return false; };
>
> Full example on compile explorer : https://godbolt.org/z/azEzKceeq so
> people could grasp the idea of this approach to the whole unicode
> problem. And this approach is to put whole Unicode standard rule mess
> out of Standard Library code into single templated class per encoding,
> and make all our classes inside Library work perfectly now when they
> know encoding type of a character, and even adding new encoding types
> for a user will be simple and how constexpr will be forced will make
> it possible to check for lots of issue at compile time and report to
> user when creating this new encoding types...
>
> How it would be simple solution if it would be possible to templated
> simple types so like :
>
> char = echar<ASCII_ENCODING>; ( maybe better char =
> echar<CHAR_ENCODING>; for back support as people used char as utf08 char )
> char8_t = echar<UTF08_ENCODING>;
> char16_t = echar<UTF16_DEFAULT_ENCODING>;
> char32_t = echar<UTF32_ENCODING>;
> wchar = echar<UTF16_DEFAULT_ENCODING> for Windows
> wchar = echar< UTF32_ENCODING > for Mac
>
> and then make all classes work with echar<> type instead of supporting
> all these 5 types ....
>
> Best regards to all :)
> Zoran Sibalic.
>
> On Tue, Oct 24, 2023 at 8:40 AM <sg16-request_at_[hidden]> wrote:
>
> Send SG16 mailing list submissions to
> sg16_at_[hidden]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> or, via email, send a message with subject or body 'help' to
> sg16-request_at_[hidden]
>
> You can reach the person managing the list at
> sg16-owner_at_[hidden]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of SG16 digest..."
>
>
> Today's Topics:
>
> 1. Agenda for the 2023-10-25 SG16 telecon (Tom Honermann)
> 2. Re: Agenda for the 2023-10-25 SG16 telecon (Jens Maurer)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 24 Oct 2023 01:11:20 -0400
> From: Tom Honermann <tom_at_[hidden]>
> To: SG16 <sg16_at_[hidden]>, Alisdair Meredith
> <alisdairm_at_[hidden]>, Jonathan Wakely <cxx_at_[hidden]>,
> Charles Barto
> <chbarto_at_[hidden]>, Mark de Wever <koraq_at_[hidden]>
> Subject: [SG16] Agenda for the 2023-10-25 SG16 telecon
> Message-ID: <22f6f509-63c5-4728-96c3-241562ffa940_at_[hidden]>
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>
> SG16 will hold a telecon on Wednesday, October 25th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20231025T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest
> <https://www.timeanddate.com/worldclock/converter.html?iso=20231025T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>>).
>
> The agenda follows.
>
> * charN_t, char_traits, codecvt, and iostreams:
> o P2873R0: Remove Deprecated Locale Category Facets For Unicode
> from C++26 <https://wg21.link/p2873r0>
> o LWG 3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly
> added
> to locale <https://wg21.link/lwg3767>
> o LWG 2959: char_traits<char16_t>::eof is a valid UTF-16
> code unit
> <https://wg21.link/lwg2959>
> + SG16 #32: std::char_traits<char16_t>::eof() requires
> uint_least16_t to be larger than 16 bits
> <https://github.com/sg16-unicode/sg16/issues/32>
> o SG16 #33: A correct codecvt facet that works with
> basic_filebuf
> can't do UTF conversions
> <https://github.com/sg16-unicode/sg16/issues/33>
>
> Hang on, this is going to be a bumpy ride.
>
> When char16_t and char32_t were added for C++11, the standard library
> was extended to support corresponding specializations of
> std::char_traits ([char.traits.general]p1
> <http://eel.is/c++draft/char.traits.general#1>) and std::basic_string
> ([string.classes.general]p1
> <http://eel.is/c++draft/string.classes#general-1>). Curiously, type
> aliases were added for specializations of the std::fpos ([iosfwd.syn]
> <http://eel.is/c++draft/iosfwd.syn#lib:fpos>) class template (but
> only
> in the synopsis) and support for these types was added for the
> std::codecvt ([tab:locale.category.facets]
> <http://eel.is/c++draft/locale.category#tab:locale.category.facets>)
> and
> std::codecvt_byname ([tab:locale.spec]
> <http://eel.is/c++draft/locale.category#tab:locale.spec>) locale
> facets,
> but not for any of the other locale facets nor for iostreams in
> general.
> Support for these types was added to std::basic_string_view
> ([string.view.synop] <http://eel.is/c++draft/string.view.synop>) and
> std::filesystem::path ([fs.path.type.cvt]p2
> <http://eel.is/c++draft/fs.path.type.cvt#2>) in C++17, but no
> additional
> support was ever extended to iostreams. The status quo is thus
> that the
> standard requires implementations to provide some fragments
> (std::fpos,
> std::codecvt, and std::codecvt_byname) of iostream support for these
> types despite there being no use of these type aliases and
> specializations in the standard; implementations are not required to
> support streams of char16_t or char32_t.
>
> std::char_traits is used by both the string library (e.g.,
> std::basic_string) and iostreams. However, the string library only
> depends on some of the std::char_traits members; it does not make
> use of
> the int_type member type alias nor any of the member functions that
> depend on that type (eof(), ?not_eof(), ?to_char_type(),
> ?to_int_type(),
> ?eq_int_type()). Per LWG 2959 <https://wg21.link/lwg2959> and SG16
> #32
> <https://github.com/sg16-unicode/sg16/issues/32>, the specified
> std::char_traits<char16_t> specialization has a defect; all char16_t
> values are valid code unit values, but the int_type member type
> alias is
> defined as uint_least16_t (the same underlying type as char16_t)
> and it
> is thus unable to hold a distinct value for EOF. The obvious fix
> is to
> use a larger type for int_type, but that would result in an ABI
> break. I
> recently asked the ABI review group if there are any known tricks
> they
> could deploy to mitigate an ABI break, but no direct solutions were
> identified; a suggestion to provide an alternative type for
> std::char_traits<char16_t> that programmers would have to
> explicitly use
> instead of the broken specialization was offered. That is an
> option, but
> since the problematic int_type member is not actually used by any
> functionality the standard requires implementors to provide, an ABI
> break in this case might have little practical consequence.
>
> When char8_t was added for C++20 via P0482R6 (char8_t: A type for
> UTF-8
> characters and strings) <https://wg21.link/p0482>, I failed to
> understand the intended purpose for which std::codecvt was added
> to the
> standard. My impression of it at the time was that it was a poorly
> designed general transcoding facility; I failed to appreciate its
> significance as a locale facet as used by iostreams. This resulted in
> two mistakes:
>
> 1. I deprecated the following specializations (and their use as
> locale
> category facets):
> std::codecvt<char16_t, char, std::mbstate_t>
> std::codecvt<char32_t, char, std::mbstate_t>
> std::codecvt_byname<char16_t, char, std::mbstate_t>
> std::codecvt_byname<char32_t, char, std::mbstate_t>
> 2. I added the following specializations as required locale category
> facets (adding the specializations themselves is arguably not a
> mistake, but adding them as locale category facets is):
> std::codecvt<char16_t, char8_t, std::mbstate_t>
> std::codecvt<char32_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char16_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char32_t, char8_t, std::mbstate_t>
>
> Note that std::codecvt facets are only used by std::basic_filebuf
> which
> only ever converts to and from elements of type char; the facets that
> convert to and from char8_t are not substitutable for that purpose.
>
> P2873R0 <https://wg21.link/p2873r0>, which SG16 already approved (or,
> rather, did not object to) during the 2023-05-26 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#may-24th-2023>, now
> seeks
> to remove the deprecated specializations. LWG 3767
> <https://wg21.link/lwg3767> tracks addressing the incorrect
> addition of
> the char8_t specializations as locale facets.
>
> Arguably, P0482R6 <https://wg21.link/p0482> should have added the
> following specializations as locale facets:
>
> * std::codecvt<char8_t, char, std::mbstate_t>
> * std::codecvt_byname<char8_t, char, std::mbstate_t>
>
> The only specification for std::codecvt_byname in the standard is the
> synopsis in [locale.codecvt.byname]
> <http://eel.is/c++draft/locale.codecvt.byname>; there is no other
> wording present.
>
> As mentioned, the standard does not require implementations to
> provide
> iostream support for the charN_t types. However, implementations
> may do
> so as an extension. If they do, then, per [filebuf.general]p7
> <http://eel.is/c++draft/input.output#filebuf.general-7>,
> specializations
> of std::codecvt<charN_t, char, std::mbstate_t> are required to be
> available via a call to std::use_facet() for the imbued locale. In
> which
> case, per the standard, the status of the necessary
> specializations are:
>
> * std::codecvt<char8_t, char, std::mbstate_t> # Not specified.
> * std::codecvt<char16_t, char, std::mbstate_t> # Deprecated.
> * std::codecvt<char32_t, char, std::mbstate_t> # Deprecated.
>
> If it is desirable to provide a better foundation for iostream
> support
> of the charN_t types, either for a future version of the standard, or
> for implementations that want to provide such support as an
> extension,
> we could undeprecate the previously deprecated specializations and
> add
> the missing one for char8_t. Since iostreams does not support
> charN_t in
> the standard today and since the char16_t and char32_t
> specializations
> have already been deprecated for two release cycles, perhaps it is
> even
> reasonable to change their behavior so that they convert to and
> from the
> locale encoding rather than UTF-8. This would remove the existing
> inconsistency with the corresponding char and wchar_t specializations
> that was part of the motivation for their deprecation in the first
> place
> (see the discussion of codecvt in the Motivation section of P0482R6
> <https://wg21.link/p0482r6#motivation>).
>
> However, an endeavor to improve the situation for iostreams and
> charN_t
> next runs into SG16 #33
> <https://github.com/sg16-unicode/sg16/issues/33>; std::basic_fstream
> does not support the UTF-8 and UTF-16 encodings for the "internal"
> side
> of a std::codecvt conversion because std::basic_filebuf requires
> that,
> per [locale.codecvt.virtuals]p4
> <http://eel.is/c++draft/locale.codecvt#virtuals-4> and its related
> footnote <http://eel.is/c++draft/locale.codecvt#footnote-246>,
> "internal" characters are mapped 1-N to "external" characters.
> This is
> an existing issue for std::basic_fstream<wchar_t> with UTF-16 data.
>
> The Microsoft and libstdc++ standard library implementations
> appear to
> support iostreams with charN_t types; at least on the surface. Libc++
> intentionally does not provide definitions for charN_t
> specializations
> of locale facets that are not required by the standard and this
> suffices
> for basic usage to provoke compilation errors. I have not yet
> investigated to what extent the Microsoft and libstdc++
> implementations
> work as might be expected. My impression is that, where they do
> produce
> expected results, it is serendipity at work. See
> https://godbolt.org/z/6T7hebY33 for a bit of fun (testing on Windows
> requires changes to use an actual zero valued file since Windows
> doesn't
> provide a builtin analog for /dev/zero, but in that case, MSVC
> produces
> an executable that behaves as might be expected).
>
> I haven't looked hard, but I have not yet identified any code in the
> wild that uses iostreams with charN_t types. One would think that, if
> any project did, it would be ICU. I confirmed that ICU, despite
> its use
> of char16_t, makes no attempt to use it with iostreams.
>
> So where is this all going? I see three general options that can be
> pursued to resolve these various issues.
>
> 1. We can fix these issues, despite the acknowledged ABI impact, so
> that the standard no longer actively hiders support for iostreams
> with the charN_t types. Optionally, we could further explore
> requiring such support in the standard (doing so would require
> adding charN_t support to more locale facets).
> 2. We can declare that iostreams will never support the charN_t types
> in the standard and deprecate and remove the fragments of such
> support that are present. Implementations could of course provide
> support as an extension if they so desire.
> 3. We can admit things are broken, choose to do nothing about it, and
> close the related LWG issues while chanting sorry-not-sorry.
>
> The above issues are sufficiently complicated that I believe a
> paper is
> warranted regardless of the direction that we favor. I'm signing
> up to
> write that paper since I'm responsible for some of the mess. I do not
> intend to poll any directions in this meeting; rather, the focus
> is to
> ensure that the issues are well understood, to discuss decisions we
> could make and their potential consequences, and to generally collect
> information that will lead to a better paper.
>
> Responses provided before the meeting to identify other existing
> related
> issues or considerations would be appreciated. Ideal responses do not
> include the phrase "burn it all to the ground".
>
> Tom.
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Tue, 24 Oct 2023 08:40:17 +0200
> From: Jens Maurer <jens.maurer_at_[hidden]>
> To: sg16_at_[hidden], Alisdair Meredith <alisdairm_at_[hidden]>,
> Jonathan Wakely <cxx_at_[hidden]>, Charles Barto
> <chbarto_at_[hidden]>, Mark de Wever <koraq_at_[hidden]>
> Cc: Tom Honermann <tom_at_[hidden]>
> Subject: Re: [SG16] Agenda for the 2023-10-25 SG16 telecon
> Message-ID: <c6b06c90-f3ed-4b00-96a6-3eedbf144f79_at_[hidden]>
> Content-Type: text/plain; charset=UTF-8
>
> Hi Tom,
>
> On 24/10/2023 07.11, Tom Honermann via SG16 wrote:
> > Hang on, this is going to be a bumpy ride.
>
> Thanks for the write-up.
>
> We should ask the implementers whether their basic_stream support
> for charN_t is intentional or accidental, and investigate a little
> more whether it works.
>
> Hyrum's law suggests that this will be used in the wild.
>
> My opinion:
>
> Let's make sure the ground is clear for a future extension
> to charN_t for basic_stream, but let's not try to address
> any of the deeper troubles (in particular the 1:N mapping for
> basic_fstream). In particular:
>
> Let's fix "int_type". The ABI of the standard library
> itself will not be broken, we just risk ABI breakage
> of user components, I think?
>
> Let's deprecate
>
> std::codecvt<char16_t, char8_t, std::mbstate_t>
> std::codecvt<char32_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char16_t, char8_t, std::mbstate_t>
> std::codecvt_byname<char32_t, char8_t, std::mbstate_t>
>
> Those might come back when a proper solution arrives.
>
>
> std::codecvt<char16_t, char, std::mbstate_t> # Deprecated.
> std::codecvt<char32_t, char, std::mbstate_t> # Deprecated.
>
> "Since iostreams does not support charN_t in the standard today
> and since the char16_t and char32_t specializations have already
> been deprecated for two release cycles, perhaps it is even
> reasonable to change their behavior so that they convert to and
> from the locale encoding rather than UTF-8."
>
> That might work for
>
> std::codecvt<char32_t, char, std::mbstate_t>
>
> but
>
> std::codecvt<char16_t, char, std::mbstate_t>
>
> runs afoul of the 1:N mapping issue, unless on a platform where
> everything
> fits into 16-bit Unicode, right?
>
> Best to leave those functions alone; I'm also ok with removing them.
>
> Jens
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
> ------------------------------
>
> End of SG16 Digest, Vol 48, Issue 7
> ***********************************
>
>

Received on 2023-10-24 14:53:31