ISOCPP sg16 List: Re: Issues with Unicode in the standard

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Sun, 21 Apr 2024 23:44:22 +0200

On 21/04/2024 20.40, Tiago Freire wrote:
>
>> What properties of a character type (as opposed to, say, a scoped or unscoped enum type) are important to you?
>> enum class my_char32_t : std::uint32_t { };
>> gets you all of the above except item 1.
>> You can create your own type as a scoped (or unscoped) enum, if you want separate types for specific encodings.
>
> Are you serious?
> Then why wasn't char8_t defined as enum class char8_t : std::uint8_t { };?

For one reason, std::uint8_t is not guaranteed to exist on all platforms.

> Why did you had to add an extra type?
> I suppose you also expect me to define my own operators to be able to do, mychar >= 'a' && mychar <= 'z', or mychar+ ('A' - 'a').

The first two operations don't make sense for a lot of encodings.
(What does it mean for a character to be "greater" than another one,
absent rather elaborate collation rules?)

For the last one, there is no requirement that an encoding have
consecutive numbers for consecutive letters, so this is also
questionable for encodings in general.

So, if you believe those operations make sense for a particular encoding
that you wish to represent as a type, go ahead and define those operations.

If you, instead, prefer automatic conversions, use an unscoped enumeration,
which has implicit promotions to the underlying type:

enum my_char32_t : std::uint32_t {};

> And how do I exactly write text like "The quick brown fox jumps over the lazy dog" with this character enumerator type?

How do you write this text such that it is encoded in your preferred encoding at all?

You seem to be ignoring vital parts of my earlier e-mails where I try to explain
that the encoding of the source file is distinct from and unrelated to the encoding
used in the binary object files produced by the compiler. The latter, for an
ordinary string literal, is called the "ordinary literal encoding".

> No! There's char8_t, char16_t, and char32_t. That's the job they should be doing, trying to restrict this to unicode is ridiculous.
>
>>> And I want to be able to do more.
>> What, exactly?
>
> Whatever I want, its nobody business. I don't have to justify myself here.

If you wish to cause changes to the C++ standard, it is generally helpful
to explain the entirety of your (technical) motivations for doing so.
Only then can others form an opinion whether the problems you see
are actual problems, and whether alternative approaches might exist.

>> Given these surroundings, I'm not seeing how the incompatibility you seem to be worried about can arise. Could you please elaborate?
>
> One compiler supports unicode version X the other support unicode version Y, you use a term that only exists Y, it won't compile on the one that uses version X.

We will guarantee support for Unicode 15.1 going forward.

In any case, stating "my program needs Unicode Y support in your compiler"
is not much different from stating "my program needs C++26 support in your
compiler". Either you have a compiler that offers that, or you can't
compile the program. That's the way it has always been when new features
are introduced. Alternatively, you can opt to restrict yourself to
the use of Unicode 15.1 names (which is already a quite comprehensive set)
and stay compatible with newer Unicode versions.

The worries come when a program that used to compile with an old
version of the compiler / Unicode no longer compiles (or silently
changes semantics) with a newer version. That's called a backward
compatibility break, but as I said, Unicode guarantees this doesn't
happen with character names.

> Unicode issues an errata breaking strict compatibility, one can have the errata, the other does not, they produce different code.
Not for character names; that would mean they would break their
promises. Do you have a specific example for that happening?

> Unicode updates have nothing to do with C++, and this is what you get.
>
>
>> \U + number is isomorphic to \N{some_name}, except you have to give a rather opaque number for the former.
>
> No, they are not. Not even close.
> I type \u1234, the data load into memory has the exact value 1234 regardless of the encoding I decide to use my string in

This is a factually incorrect statement. This is not how universal-character-names
work in C++. Please read [lex.phases], [lex.charset], and [lex.string] in the C++ standard;
if you have any questions about the meaning of the normative text, please do not
hesitate to ask.

In particular, I can write "\u0041" in my source code (this is intended
to be a string-literal), and if the compiler is configured such that its
ordinary literal encoding is EBCDIC, the representation in memory will be
an "A" in EBCDIC encoding (which is 0xC1, if I read https://en.wikipedia.org/wiki/EBCDIC
correctly).

> , it could even be invalid Unicode.

It can't. See [lex.charset] p4

"The program is ill-formed if that number is not a Unicode scalar value."

> \N{some_name} requires a specific mapping from that specific string to a specific number as defined by the unicode standard.
> I want to use a different encoding that has the exact same character but at a different code point,

And that works perfectly fine if you tell your compiler to use the
desired ordinary literal encoding.

> Nope, doesn't work, plus it intentionally misleads me.
>
> \U can be any encoding doesn't matter, \N can only be unicode.

> And I'm perfectly aware that this may be hard to convince you right now. But I hope to be able to wake somebody up.

Convincing me works better if we can first agree on the status quo
of the C++ standard, as far as your concerns are affected.

Jens

Received on 2024-04-21 21:44:34