C++ Logo

sg16

Advanced search

Re: Agenda for the 2024-04-24 SG16 meeting

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Sat, 20 Apr 2024 10:33:59 +0000

> Getting a name from reflection:
> We can't know how the string will be used, so it needs to follow the rules of C++: Either it is a u8 string, and is utf-8 encoding, or it is a non-utf string in the literal encoding (might be ebcdic, etc). Only utf-8 (or another unicode encoding) can represent all identifiers.

I completely disagree, using u8 should only indicate that the underlying character type is char8_t, not that there’s an utf-8 encoding of the underlying character sequence.


I have been seeing this discussion and I have to express my frustration and disappointment with the whole thing.
I think the following needs to be said.

I think the approach here is being overly complicated.
There’s already an industry standard way of identifying the encoding of a file, and that’s by using a BOM, if the BOM is missing from the file we shouldn’t assume by default that the file is encoded in utf8.
Not that I think that this should play any relevance to the C++ language at all, it is the responsibility of the compiler to identify the encoding of the file, decide if it wants to support it, and then translate it to something that can normatively be interpreted,
only after that point C++ standards should have any relevancy. If the file is incorrectly encoded, it is the compiler responsibility to deal with the encoding and reject the file, and C++ standard shouldn’t play a role in this at all.

I’m going to be honest about my personal opinion about this, Unicode is a complete and utter mess, it’s garbage as a standard, it fails at what it supposed to do, and we shouldn’t have to suffer because of the inanities of that “optional to use” standard.
And I personally would like to see it be superseded by something else that is sensible. I loathe it and I hope it dies.
I would like to point that we are only having this discussion because of stupid ideas that got into Unicode. Were it sensible, we would not be here.
We shouldn’t be giving special treatment to Unicode just because it is popular, nor should we rest on our laurels thinking that this problem is done and solved.
If somebody else comes up with a better standard, and they can code C++ using it right out of the gate the better.

I think it is far more productive to allow developers to use whatever encoding that they want, we should be helping developers to write code, not tell them what code to write. And hopefully they will do something better than what we have right now.


Having had that little rant, that doesn’t mean that Unicode can not be supported. This is actually quite easy, and you actually see it everywhere.
The way you do it is:
BY NOT EXPLICITLY SUPPORTING IT

This is not my first rodeo with Unicode, to me this is a solved problem, this approach works, it works exceedingly well.
The answer is not more unicode, its less.

All control characters in C++ should be under 0x7F, utf-8 here is irrelevant.
Do you want an escape sequences like \u2057 to add a multi-byte encoded utf8 into an 8bit character type string? Sure, we can support that, there’s a well define way to do that. Weird corner case don’t even apply here. That’s the only place where utf8 should even be relevant.

Do you want to encode emoji in string literals? No problem, a valid multi-byte code point in utf8 does not have bytes with values bellow 0x7F, we don’t actually need to care,
whatever bytes are there is what ends-up in the binary, the fact that it is supposed to be utf8 is irrelevant, the standard shouldn’t care, it’s the compilers responsibility to convert the code point back to its byte sequence.

The file has an invalid utf8 byte sequence? That’s the compilers problem to reject the file because it is incorrectly encoded, the encoding of the file should be delt upfront by the reading algorithm, it shouldn’t even get to the point where C++ interpretation even matters.

The file is actually encoded in UCS4 and it has a code point 1533 to be encoded in place where there should be a 8bit character? If it’s a single character then it’s the compiler responsibility to fail to compile, because it can’t encode a 1533 into 1 byte, the fact that the language is C++ is irrelevant here.
Is it actually supposed to be in a string? We can define that such sequences should PREFERENTIALLY be convert to an utf8 sequence from whatever encoding, but it’s the responsibility of the compiler that decides to support that source encoding to do the conversion correctly. If it can’t do that, then maybe the compiler shouldn’t support it.

The source file is using an encoding that doesn’t support a specific control character? There must be a replacement for it somehow, if the compiler supports it, then it is responsible for providing a valid translation. It can’t do that, then tough! How about not support it? That’s a perfectly valid option. Not a C++ problem.
If you disagree, I can do the ridiculous of inventing an encoding format that can only encode the character ‘0’ and ‘!’, good luck supporting that!
(someone just inventing an encoding and people deciding to use it is how all encodings came to be, there’s nothing to say that they are not broken in the same ridiculous way as exemplified, but with extra steps, stop trying to support everything)

Do you want to avoid character look alike attacks? That’s exclusively a human problem, it is the problem of the IDE to highlight those for the human. Visual studio code has that functionality, and it works quite well to identify those for me, other IDE’s should follow suit, you can even make tools to flag those. Again, the C++ standard shouldn’t care. If it goes trough the compiler, good or bad, that’s what you get add a code analysis tool to stop that.

Do you want to use weird combining marks that use bytes that map to control points in C++ but they are not intended to be interpreted as control characters? Well, I guess you cannot do that!
Are you guys really serious about writing identifiers with emoji? For real? How is this helpful? Common guys, what are we doing?
I don’t want to have to read emoji as code, EVER! It shouldn’t be a thing. If you think otherwise, please go away, go do something else more productive, this is a waste of time. You are not fixing something, you are not helping me code, you are actually making things worse for me as a developer. Please stop that!

C++ is English, keywords like ‘if’, ‘else’, ‘while’, that’s English. Sure, there’s nothing to stop people from writing function names and identifiers in Korean, Japanese, or other scripts, I’m all for it, power to them.
Your script doesn’t work with C++? Tough!
You want to make a fuss because you can’t have poop emoji as a type name? Get bent!

What else is there left that is even relevant to discuss about unicode?

Whit all of this I can still write algorithms and process text in utf8 or Unicode or whatever other encoding I want. I can still write code to process and handle all the intricacies of Unicode without surprises just fine. What do I need this for? Tell how does this help me write code?


This is something that I have told many of my coworkers when dealing with issue with design. Far more important than the things you add are the things that you don’t.
How about we don’t do this?

Just trim down the inconsistencies that currently exist to make things well defined, and let’s not shoot ourselves in the foot by trying to force support of all inane things of a broken standard.
You really need to reconsider the priorities here.

Received on 2024-04-20 10:34:06