ISOCPP sg16 List: Re: Issues with Unicode in the standard

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Sun, 21 Apr 2024 18:40:53 +0000

> What properties of a character type (as opposed to, say, a scoped or unscoped enum type) are important to you?
> enum class my_char32_t : std::uint32_t { };
> gets you all of the above except item 1.
> You can create your own type as a scoped (or unscoped) enum, if you want separate types for specific encodings.

Are you serious?
Then why wasn't char8_t defined as enum class char8_t : std::uint8_t { };?
Why did you had to add an extra type?
I suppose you also expect me to define my own operators to be able to do, mychar >= 'a' && mychar <= 'z', or mychar+ ('A' - 'a').
And how do I exactly write text like "The quick brown fox jumps over the lazy dog" with this character enumerator type?

No! There's char8_t, char16_t, and char32_t. That's the job they should be doing, trying to restrict this to unicode is ridiculous.

>> And I want to be able to do more.
> What, exactly?

Whatever I want, its nobody business. I don't have to justify myself here.

> Given these surroundings, I'm not seeing how the incompatibility you seem to be worried about can arise. Could you please elaborate?

One compiler supports unicode version X the other support unicode version Y, you use a term that only exists Y, it won't compile on the one that uses version X.
Unicode issues an errata breaking strict compatibility, one can have the errata, the other does not, they produce different code.
Unicode updates have nothing to do with C++, and this is what you get.

> \U + number is isomorphic to \N{some_name}, except you have to give a rather opaque number for the former.

No, they are not. Not even close.
I type \u1234, the data load into memory has the exact value 1234 regardless of the encoding I decide to use my string in, it could even be invalid Unicode.
\N{some_name} requires a specific mapping from that specific string to a specific number as defined by the unicode standard.
I want to use a different encoding that has the exact same character but at a different code point, Nope, doesn't work, plus it intentionally misleads me.

\U can be any encoding doesn't matter, \N can only be unicode.

This is the problem. Requiring compilers to support utf8 encoding, restricting character types to strictly for unicode usage.
Restricting code points because of unicode behavior. Make special concessions in core language features like \N specific to unicode.
And this keeps going and going, little by little.

You may say that you want to allow for other encodings, and you are not restricting anyone from using anything else, but the matter of fact is that these features make utf8 the de-facto encoding for C++.
It may not be intentional, but these "inconsequential features" make it so that the only correct way to write code has to be done in utf8, and it also makes it that the preferred way to manage text data in applications is utf8.
This didn't use to be a concern and now it is.

It may not be the case, but the practical effect is what it feels like is that the SG16 is bullying developers into adopting unicode.
And if I sounds crass, is because I don't like being bullied by some one telling me what code I should write.
The worse is, this is a pattern, and we have a problem.

And I'm perfectly aware that this may be hard to convince you right now. But I hope to be able to wake somebody up.

Received on 2024-04-21 18:40:58