C++ Logo

sg16

Advanced search

Issues with Unicode in the standard

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Sun, 21 Apr 2024 07:44:55 +0000
I have created a new thread in order to not hijack the previous one.
Let’s make it a separate discussion.

Answering Tom’s question

> Why do you want to use char8_t, char16_t, and char32_t for non-Unicode encodings?
Because these are types with well-defined properties:
1. They are character types
2. They are a distinct type. Distinct from uint8_t/uint16_t/uint32_t, allowing for correct overloading.
3. They have portable predictable widths (wchar_t doesn’t)
4. They have predictable properties such as signdness (char doesn’t)

No other types have these set of properties. utf8 may have been a driving reason, but it’s a type.
It’s char8_t, not char_utf8_t, what should I use if want to manipulate CP437? Should we create a distinct type char_cp437_t?
What if I want to create software for the Chinese market and be compatible with GB18030, should we create a distinct type char_GB18030_2022_t?

I’m already using these types, not only for unicode, but also for things that are unicode-like (it’s almost directly convertible but not exactly, they are mostly printable as is, but you can’t use standard algorithms to convert it).
And I want to be able to do more.

There’s absolutely no reason to specify in the standard that char8_t should encode utf8, because that’s a detail that doesn’t really matter until you try to convert it, either to a different format, an entry in a database, or to a pixel on screen.
I shouldn’t need to be asked the question “why” you would want to use it in this way. These things exist, I want to use it because it’s the only way to do it correctly, it’s a perfectly valid thing to do.

We shouldn’t stick a flag on these types and say “No, hum hum, you used char8_t, you can only use it to encode utf8, nothing else”, the tyranny of unicode. Is this really the direction we want to go?


> How is support for named-universal-characters (\N{xxx}) problematic for you?

It’s a terrible idea. Compilers now need to drag along a huge set of named unicode characters that needs to be update with the unicode standard, making sure that at some point code will become incompatible depending on standard the vendor decided to support.
Will you do this for CP437 or GB18030? Why wasn’t \U, \x enough for you? You can look up the character name, but you can’t look up its code? Should this be allowed even if the underlying text encoding is not even unicode? Why was this wart necessary?


> How does the P1949R7 identifier syntax restrict code you would like to write?
> What do you believe restricts your use of your preferred source file encoding?

Strictly speaking it doesn’t stop me NOW. But it will if and when a new encoding standard comes along.
Having multiple file encodings is already a thing that exists. How do you envision this being implemented in practice?
Should only files encoded in Unicode follow these rules? You want to make it so that code point 0x200D is perfectly allowed if the encoding is something else other than unicode, but because it shows up in a unicode encoded file you can’t, have it?
Why?
Why would I want to stop people from using it? What do I gain by having to code more complex compilers to enforce these rules?
They are a bad idea to use, me as an adult can just not use it without having the compiler police come knocking at my door.

Here’s a better idea. Let’s not explicitly support unicode. Let’s not do anything.
It saves me allot of work, compilers are much easier to implement, and ship much smaller, I can read my text in unicode just fine, everyone is happy.

Your issue is, it causes confusion in some cases when you use unicode? How about not use unicode? Why should I prevent developing something else better than unicode in the future because you want to restrict something that you shouldn’t be doing to begin with?

That’s my point, doing nothing is better than doing anything.


Received on 2024-04-21 07:44:59