ISOCPP sg16 List: Re: Issues with Unicode in the standard

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Sun, 21 Apr 2024 17:07:26 +0200

On 21/04/2024 09.44, Tiago Freire wrote:
>> Why do you want to use char8_t, char16_t, and char32_t for non-Unicode encodings?
>
> Because these are types with well-defined properties:
>
> 1. They are character types

What properties of a character type (as opposed to, say,
a scoped or unscoped enum type) are important to you?

> 2. They are a distinct type. Distinct from uint8_t/uint16_t/uint32_t, allowing for correct overloading.
>
> 3. They have portable predictable widths (wchar_t doesn’t)
>
> 4. They have predictable properties such as signdness (char doesn’t)
>
> No other types have these set of properties.

enum class my_char32_t : std::uint32_t { };

gets you all of the above except item 1.

> utf8 may have been a driving reason, but it’s a type.
>
> It’s char8_t, not char_utf8_t, what should I use if want to manipulate CP437? Should we create a distinct type char_cp437_t?
>
> What if I want to create software for the Chinese market and be compatible with GB18030, should we create a distinct type char_GB18030_2022_t?

You can create your own type as a scoped (or unscoped) enum,
if you want separate types for specific encodings.

> I’m already using these types, not only for unicode, but also for things that are unicode-like (it’s almost directly convertible but not exactly, they are mostly printable as is, but you can’t use standard algorithms to convert it).

Nobody stops you. The one situation where the core language ties charN_t
to a particular encoding is in the interpretation of uN character and
string literals. But the compiler won't be able to divine your preferred
encoding for those anyway, so that's probably not a concern for you.

And if a particular standard library function doesn't suit your needs,
you're at liberty to simply ignore that function. Others, however,
may find that library function useful for their needs.

> And I want to be able to do more.

What, exactly?

> There’s absolutely no reason to specify in the standard that char8_t should encode utf8, because that’s a detail that doesn’t really matter until you try to convert it, either to a different format, an entry in a database, or to a pixel on screen.

The standard doesn't say that your program has undefined behavior
if you happen to store non-UTF-8 data in a char8_t. You're unlikely
to be able to use standard library functionality (that hopefully will
arrive, eventually) that takes a char8_t range and assumes it contains
UTF-8 data, but that's par for the course, in my view.

> I shouldn’t need to be asked the question “why” you would want to use it in this way. These things exist, I want to use it because it’s the only way to do it correctly, it’s a perfectly valid thing to do.
>
>
>
> We shouldn’t stick a flag on these types and say “No, hum hum, you used char8_t, you can only use it to encode utf8, nothing else”, the tyranny of unicode. Is this really the direction we want to go?

I would find software that uses char8_t for anything but UTF-8
questionable, but that's my personal opinion. Nothing in the
standard prevents you from using the char8_t type in your program
in any way you see fit.

>> How is support for named-universal-characters (\N{xxx}) problematic for you?
>
>
>
> It’s a terrible idea. Compilers now need to drag along a huge set of named unicode characters that needs to be update with the unicode standard, making sure that at some point code will become incompatible depending on standard the vendor decided to support.

Unicode has a stability guarantee for character names. Even character names with
typos in them will be supported forever. New names will be added, but existing
names will never change or vanish. If you've written a program using names from
Unicode 15, that will continue work even when your compiler now uses the names
from Unicode 25.

Given these surroundings, I'm not seeing how the incompatibility you seem to
be worried about can arise. Could you please elaborate?

Regarding the size of the character names database: Vendors were consulted before
adding that requirement into the standard, and some clever compression techniques
were demonstrated that actually pare down the size quite a bit. For users,
some fixed table of at most a few MB in your compiler simply doesn't matter.

> Will you do this for CP437 or GB18030? Why wasn’t \U, \x enough for you?

\U + number is isomorphic to \N{some_name}, except you have to give a
rather opaque number for the former. Both options identify a particular
Unicode code point, which may or may not correspond to a character in the
encoding of your choice. Where is the difference?

> You can look up the character name, but you can’t look up its code?

Using a human-readable name in source code conveys at least some meaning
to the reader, even without any lookup in a secondary source. That's
generally not true for plain numbers.

> Should this be allowed even if the underlying text encoding is not even unicode? Why was this wart necessary?

What's the "underlying text encoding"?

Maybe your program is compiled on a platform with a different encoding
than where it will run eventually.

>> How does the P1949R7 identifier syntax restrict code you would like to write?
>
>> What do you believe restricts your use of your preferred source file encoding?
>
>
>
> Strictly speaking it doesn’t stop me *NOW*.

Great.

> But it will if and when a new encoding standard comes along.

Could you please outline some properties of such a hypothetical
future encoding standard that would make the current C++ approach
untenable?

> Having multiple file encodings is already a thing that exists. How do you envision this being implemented in practice?
>
> Should only files encoded in Unicode follow these rules? You want to make it so that code point 0x200D is perfectly allowed if the encoding is something else other than unicode, but because it shows up in a unicode encoded file you can’t, have it?
>
> Why?
>
> Why would I want to stop people from using it? What do I gain by having to code more complex compilers to enforce these rules?

Compilers are written by very few people, compared to millions of C++
programmers out there in the world. We do listen to implementers,
but we are cognizant that some burden for a few does not outweigh
benefits for many.

And, as I said above, there is fairly little enforcement here, anyway.

> They are a bad idea to use, me as an adult can just not use it without having the compiler police come knocking at my door.

I don't think the phrase "compiler police" is adequate in a technical discussion.

As was pointed out earlier, the encoding of your C++ source file is completely
abstracted away by [lex.phases] p1.1. You can choose any encoding you want for
this, and it has no impact on what \u1234 means in your program. None at all.

> Here’s a better idea. Let’s not explicitly support unicode. Let’s not do anything.
>
> It saves me allot of work,

What is your work? What exactly would you want to do different?

> compilers are much easier to implement, and ship much smaller, I can read my text in unicode just fine, everyone is happy.
>
> Your issue is, it causes confusion in some cases when you use unicode? How about not use unicode? Why should I prevent developing something else better than unicode in the future because you want to restrict something that you shouldn’t be doing to begin with?

If and when an ostensibly better replacement for Unicode comes along,
I'm sure WG21 will be open to reconsider the rather few dependencies
on Unicode in the core language.

Jens

Received on 2024-04-21 15:07:36