Let's add mandates for CHAR_BIT == 8 everywhere UTF related. UTF is simply not defined in other scenarios.
So if you're on a platform without 8-bit bytes we should also disable support like std::to_chars with char8_t, std::format with char8_t format strings, etc.? That seems a bit too far. And if you're saying it's totally fine for char8_t to not be 8-bit (its underlying type is unsigned char by the way) but not for char to be non-8-bit if someone wants UTF support, I don't see any coherence to the design. It seems like we would basically need to yeet all Unicode support across the standard library out the window on such platforms.
Unicode has been around for over 30 years.
in that time, no one went to the Unicode consortium asking "look, I have a non 8 bits platform, I want to use Unicode, how is this supposed to work?". Unicode is very clear that utf encodings are sequences of octets. it has never been an issue.
Even if you set aside the non existence of toolchains, the sets of platform where you need to process text and the sef of platforms that are not 8 bits based don't overlap.
I'm also not a fan of supporting weird char sizes, but if they are allowed in the language, we probably shouldn't make the Unicode support pay for it. I don't see much of a problem with having a few unused upper bits in a char8_t or char. Just because it has 64 bits doesn't mean you can't store UTF-8 code unit values inside.
Who is going to write the unit tests and run them? You mention upper bits but...why? there is no spec, whatsoever.
We shouldn't make things up just to solve a problem that doesn't exist for customers we don't have.
Anyway, my greater issue with endian views is that they do the endianness conversion in the wrong place. You get a mathematically meaningless uint32_t value when performing a byteswap on, say, a Unicode code point; the only purpose is to dump the bytes of that uint32_t to memory. The byteswap should be taking place in a serialization view that produces a range of bytes and can either encode to little endian or big endian.
The little dance we force users to go through for, say, serializing std::float32_t is just not ergonomic: bit-cast the range to uint32_t, use a second view to transform the endianness, use a third view to generate the byte array, and a fourth view to join the bytes into a single range again. This sounds pretty terrible to me.
The examples in the paper are contrived, in order to obtain a nicer-looking before/after comparison table.
constexpr vector<uint32_t> utf16be_to_utf32be(
const vector<uint16_t>& utf16be_data)
Why would I be holding a vector of big-endian uint16_t data in the first place? There is nothing I can do with this vector except transform its
endianness so it becomes useful, or shoot myself in the foot by
forgetting about the endianness of the data inside.
I should either be holding a byte vector where the date is encoded in big-endian, or a vector of uint16_t or char16_t with native endianness.
- If I started with a byte vector, it would be obvious in the comparison table that the paper's proposed feature is doing little to help the user.
- If I started with native endian data, I wouldn't need the paper's feature at all.
I agree with that.