C++ Logo

sg16

Advanced search

LWG 2959: char_traits<char16_t>::eof is a valid UTF-16 code unit

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 30 May 2023 13:01:51 -0400
SG16 has had an issue <https://github.com/sg16-unicode/sg16/issues/32>
tracking a design defect with the std::char_traits<char16_t>
specialization since 2018. The issue was originally reported by Jonathan
Wakely as LWG 2959 <https://wg21.link/lwg2959>. Jonathan recently
created an LWG github tracking issue
<https://github.com/cplusplus/papers/issues/1572> and assigned it to
SG16. I'll schedule this for discussion at a future SG16 telecon, but
would like to discuss some options on the mailing list first. I
encourage reading the comments in the SG16 issue
<https://github.com/sg16-unicode/sg16/issues/32> before proceeding.

Briefly, the design defect is that std::char_traits<char16_t>::int_type
is specified ([char.traits.specializations.char16.t]
<http://eel.is/c++draft/char.traits.specializations.char16.t>) to be
std::uint_least16_t. The problem is that all 16-bit values are valid
code units in UTF-16, so there is no value left to indicate the EOF
condition that int_type is intended to be used for (unless
uint_least16_t happens to be larger than 16-bit; something that is not
the case for major implementations).

There does not appear to be a way to fix this problem without causing an
ABI break; we can't just change the int_type type alias to use a larger
type. However, we could bring the issue to the ABI Review Group (ARG) to
see if they know of some black magic that could be useful.

This same problem occurs for std::char_traits<wchar_t> when both wchar_t
and wint_t (the specified target of the int_type member type alias;
[char.traits.specializations.wchar.t]
<http://eel.is/c++draft/char.traits.specializations.wchar.t>) are 16-bit
types and the wide character encoding is UTF-16. This is the case for
Microsoft's implementation.

Absent a way to fix the problem directly, it seems a
deprecate-and-replace strategy will be required. I haven't thoroughly
researched this, but a possible approach is to deprecate the existing
int_type member (for all of the std::char_traits specializations) and to
introduce a new eof_type member that is guaranteed to be able to hold a
value that is not a valid code unit value. This would require
replacements for at least the following member functions:

  * static constexpr int_type not_eof(int_type c) noexcept;
  * static constexpr char_type to_char_type(int_type c) noexcept;
  * static constexpr int_type to_int_type(char_type c) noexcept;
  * static constexpr bool eq_int_type(int_type c1, int_type c2) noexcept;
  * static constexpr int_type eof() noexcept;

What would be the consequences of such a change? Are other parts of the
standard library impacted?

I know that nobody is excited about the prospect of spending time
addressing issues with char_traits.

Please share your thoughts.

Tom.

Received on 2023-05-30 17:01:54