C++ Logo


Advanced search

Re: LWG 2959: char_traits<char16_t>::eof is a valid UTF-16 code unit

From: Peter Brett <pbrett_at_[hidden]>
Date: Tue, 30 May 2023 17:18:53 +0000
Hi Tom,

I saw that Jon bumped this on the issue tracker. I’m loathe to spend much time on this; char_traits is just completely broken.

Unfortunately, the brokenness propagates from char_traits to other standard library facilities, so we probably need to talk about it.

Just picking an example at random, if we take the deprecate-and-replace strategy, we immediately have to tackle basic_istream::peek() and basic_istream::get(), for example, which is defined to return a traits::int_type value. Would we end up in our classic “broken defaults” situation, where users should always call basic_istream::peek_eof() but the broken function has the pretty name?

I would like the ABI Review Group to look at this to get some specific feedback on the impact of increasing the size of std::char_traits<char16_t>::int_type to std::uint_least32_t. This seems by far to be the simplest and most preferable option, if it’s available.

Best regards,


From: SG16 <sg16-bounces_at_lists.isocpp.org> On Behalf Of Tom Honermann via SG16
Sent: 30 May 2023 18:02

SG16 has had an issue<https://urldefense.com/v3/__https:/github.com/sg16-unicode/sg16/issues/32__;!!EHscmS1ygiU1lA!BWyBP8Xbr5jD_zDUWTAzE0OYjcPeJKNyAieay1mY2Fh3xpR_r2yAGWkxTpEYzTxPfyq6m4gaYPkr8qA$> tracking a design defect with the std::char_traits<char16_t> specialization since 2018. The issue was originally reported by Jonathan Wakely as LWG 2959<https://urldefense.com/v3/__https:/wg21.link/lwg2959__;!!EHscmS1ygiU1lA!BWyBP8Xbr5jD_zDUWTAzE0OYjcPeJKNyAieay1mY2Fh3xpR_r2yAGWkxTpEYzTxPfyq6m4gaMjYSvcI$>. Jonathan recently created an LWG github tracking issue<https://urldefense.com/v3/__https:/github.com/cplusplus/papers/issues/1572__;!!EHscmS1ygiU1lA!BWyBP8Xbr5jD_zDUWTAzE0OYjcPeJKNyAieay1mY2Fh3xpR_r2yAGWkxTpEYzTxPfyq6m4ga9eBL4_Q$> and assigned it to SG16. I'll schedule this for discussion at a future SG16 telecon, but would like to discuss some options on the mailing list first. I encourage reading the comments in the SG16 issue<https://urldefense.com/v3/__https:/github.com/sg16-unicode/sg16/issues/32__;!!EHscmS1ygiU1lA!BWyBP8Xbr5jD_zDUWTAzE0OYjcPeJKNyAieay1mY2Fh3xpR_r2yAGWkxTpEYzTxPfyq6m4gaYPkr8qA$> before proceeding.

Briefly, the design defect is that std::char_traits<char16_t>::int_type is specified ([char.traits.specializations.char16.t]<https://urldefense.com/v3/__http:/eel.is/c**Adraft/char.traits.specializations.char16.t__;Kys!!EHscmS1ygiU1lA!BWyBP8Xbr5jD_zDUWTAzE0OYjcPeJKNyAieay1mY2Fh3xpR_r2yAGWkxTpEYzTxPfyq6m4ga-yxtaIc$>) to be std::uint_least16_t. The problem is that all 16-bit values are valid code units in UTF-16, so there is no value left to indicate the EOF condition that int_type is intended to be used for (unless uint_least16_t happens to be larger than 16-bit; something that is not the case for major implementations).

There does not appear to be a way to fix this problem without causing an ABI break; we can't just change the int_type type alias to use a larger type. However, we could bring the issue to the ABI Review Group (ARG) to see if they know of some black magic that could be useful.

This same problem occurs for std::char_traits<wchar_t> when both wchar_t and wint_t (the specified target of the int_type member type alias; [char.traits.specializations.wchar.t]<https://urldefense.com/v3/__http:/eel.is/c**Adraft/char.traits.specializations.wchar.t__;Kys!!EHscmS1ygiU1lA!BWyBP8Xbr5jD_zDUWTAzE0OYjcPeJKNyAieay1mY2Fh3xpR_r2yAGWkxTpEYzTxPfyq6m4gaSk4AOEY$>) are 16-bit types and the wide character encoding is UTF-16. This is the case for Microsoft's implementation.

Absent a way to fix the problem directly, it seems a deprecate-and-replace strategy will be required. I haven't thoroughly researched this, but a possible approach is to deprecate the existing int_type member (for all of the std::char_traits specializations) and to introduce a new eof_type member that is guaranteed to be able to hold a value that is not a valid code unit value. This would require replacements for at least the following member functions:

  * static constexpr int_type not_eof(int_type c) noexcept;
  * static constexpr char_type to_char_type(int_type c) noexcept;
  * static constexpr int_type to_int_type(char_type c) noexcept;
  * static constexpr bool eq_int_type(int_type c1, int_type c2) noexcept;
  * static constexpr int_type eof() noexcept;

What would be the consequences of such a change? Are other parts of the standard library impacted?

I know that nobody is excited about the prospect of spending time addressing issues with char_traits.

Please share your thoughts.


Received on 2023-05-30 17:19:04