On Sun, 21 Jun 2026 at 18:39, Corentin Jabot via SG16 <sg16@lists.isocpp.org> wrote:
Let's add mandates for CHAR_BIT == 8 everywhere UTF related. UTF is simply not defined in other scenarios.

So if you're on a platform without 8-bit bytes we should also disable support like std::to_chars with char8_t, std::format with char8_t format strings, etc.? That seems a bit too far. And if you're saying it's totally fine for char8_t to not be 8-bit (its underlying type is unsigned char by the way) but not for char to be non-8-bit if someone wants UTF support, I don't see any coherence to the design. It seems like we would basically need to yeet all Unicode support across the standard library out the window on such platforms.

I'm also not a fan of supporting weird char sizes, but if they are allowed in the language, we probably shouldn't make the Unicode support pay for it. I don't see much of a problem with having a few unused upper bits in a char8_t or char. Just because it has 64 bits doesn't mean you can't store UTF-8 code unit values inside.

Anyway, my greater issue with endian views is that they do the endianness conversion in the wrong place. You get a mathematically meaningless uint32_t value when performing a byteswap on, say, a Unicode code point; the only purpose is to dump the bytes of that uint32_t to memory. The byteswap should be taking place in a serialization view that produces a range of bytes and can either encode to little endian or big endian.

The little dance we force users to go through for, say, serializing std::float32_t is just not ergonomic: bit-cast the range to uint32_t, use a second view to transform the endianness, use a third view to generate the byte array, and a fourth view to join the bytes into a single range again. This sounds pretty terrible to me.

The examples in the paper are contrived, in order to obtain a nicer-looking before/after comparison table.

constexpr vector<uint32_t> utf16be_to_utf32be(
    const vector<uint16_t>& utf16be_data)
 
Why would I be holding a vector of big-endian uint16_t data in the first place?  There is nothing I can do with this vector except transform its endianness so it becomes useful, or shoot myself in the foot by forgetting about the endianness of the data inside. I should either be holding a byte vector where the date is encoded in big-endian, or a vector of uint16_t or char16_t with native endianness.
  • If I started with a byte vector, it would be obvious in the comparison table that the paper's proposed feature is doing little to help the user.
  • If I started with native endian data, I wouldn't need the paper's feature at all.