Date: Sat, 20 Apr 2024 13:58:31 +0200
On 20/04/2024 12.33, Tiago Freire via SG16 wrote:
>
>
>> Getting a name from reflection:
>
>> We can't know how the string will be used, so it needs to follow the rules of C++: Either it is a u8 string, and is utf-8 encoding, or it is a non-utf string in the literal encoding (might be ebcdic, etc). Only utf-8 (or another unicode encoding) can represent all identifiers.
>
>
>
> I completely disagree, using u8 should only indicate that the underlying character type is char8_t, not that there’s an utf-8 encoding of the underlying character sequence.
Well, that's not the status quo of the C++ standard.
See table [tab:lex.string.literal] in [lex.string].
If you feel that C++ is taking the wrong direction in this matter
(or any of the other matters you voice below), please write
a paper with rationale, directed at SG16.
I'm seeing a lot of venting below, but rather little rationale,
and also rather few constructive alternatives.
> I think the approach here is being overly complicated.
>
> There’s already an industry standard way of identifying the encoding of a file, and that’s by using a BOM, if the BOM is missing from the file we shouldn’t assume by default that the file is encoded in utf8.
>
> Not that I think that this should play any relevance to the C++ language at all, it is the responsibility of the compiler to identify the encoding of the file, decide if it wants to support it, and then translate it to something that can normatively be interpreted,
>
> only after that point C++ standards should have any relevancy. If the file is incorrectly encoded, it is the compiler responsibility to deal with the encoding and reject the file, and C++ standard shouldn’t play a role in this at all.
And that's what C++ says in [lex.phases] p1.1, which we took great
effort to phrase in a way to enable the model you're describing.
Do you have any particular concern with that wording?
> I’m going to be honest about my personal opinion about this, Unicode is a complete and utter mess, it’s garbage as a standard, it fails at what it supposed to do, and we shouldn’t have to suffer because of the inanities of that “optional to use” standard.
There is nothing "optional" about the Unicode standard; it's a
normative reference for C++; see [intro.refs].
> If somebody else comes up with a better standard, and they can code C++ using it right out of the gate the better.
And nothing in the C++ standard prevents anyone from doing that.
> I think it is far more productive to allow developers to use whatever encoding that they want, we should be helping developers to write code, not tell them what code to write. And hopefully they will do something better than what we have right now.
>
>
> Having had that little rant, that doesn’t mean that Unicode can not be supported. This is actually quite easy, and you actually see it everywhere.
>
> The way you do it is:
>
> *BY NOT EXPLICITLY SUPPORTING IT*
There is a desire to portably express code points (characters, if you want)
in non-English writing systems, without requiring such code points to be
encoded in the source file in something outside of the basic character set.
We need a way to identify such code points / characters, and the one
comprehensive catalog for writing systems is Unicode, so C++ (ever since
C++98) chose to use its numbering (and, since recently, its naming) for
identifying such characters portably.
I'm not aware of any other similarly comprehensive catalog. Do you
have an alternative in mind?
> Do you want to encode emoji in string literals? No problem, a valid multi-byte code point in utf8 does not have bytes with values bellow 0x7F, we don’t actually need to care,
>
> whatever bytes are there is what ends-up in the binary, the fact that it is supposed to be utf8 is irrelevant, the standard shouldn’t care, it’s the compilers responsibility to convert the code point back to its byte sequence.
>
>
>
> The file has an invalid utf8 byte sequence? That’s the compilers problem to reject the file because it is incorrectly encoded, the encoding of the file should be delt upfront by the reading algorithm, it shouldn’t even get to the point where C++ interpretation even matters.
>
>
>
> The file is actually encoded in UCS4 and it has a code point 1533 to be encoded in place where there should be a 8bit character? If it’s a single character then it’s the compiler responsibility to fail to compile, because it can’t encode a 1533 into 1 byte, the fact that the language is C++ is irrelevant here.
>
> Is it actually supposed to be in a string? We can define that such sequences should *PREFERENTIALLY* be convert to an utf8 sequence from whatever encoding, but it’s the responsibility of the compiler that decides to support that source encoding to do the conversion correctly. If it can’t do that, then maybe the compiler shouldn’t support it.
You seem to be talking about [lex.phases] p1.1 a lot; I'm not seeing new information here.
Jens
> The source file is using an encoding that doesn’t support a specific control character? There must be a replacement for it somehow, if the compiler supports it, then it is responsible for providing a valid translation. It can’t do that, then tough! How about not support it? That’s a perfectly valid option. Not a C++ problem.
>
> If you disagree, I can do the ridiculous of inventing an encoding format that can only encode the character ‘0’ and ‘!’, good luck supporting that!
>
> (someone just inventing an encoding and people deciding to use it is how all encodings came to be, there’s nothing to say that they are not broken in the same ridiculous way as exemplified, but with extra steps, stop trying to support everything)
>
>
>
> Do you want to avoid character look alike attacks? That’s exclusively a human problem, it is the problem of the IDE to highlight those for the human. Visual studio code has that functionality, and it works quite well to identify those for me, other IDE’s should follow suit, you can even make tools to flag those. Again, the C++ standard shouldn’t care. If it goes trough the compiler, good or bad, that’s what you get add a code analysis tool to stop that.
>
>
>
> Do you want to use weird combining marks that use bytes that map to control points in C++ but they are not intended to be interpreted as control characters? Well, I guess you cannot do that!
>
> Are you guys really serious about writing identifiers with emoji? For real? How is this helpful? Common guys, what are we doing?
>
> I don’t want to have to read emoji as code, EVER! It shouldn’t be a thing. If you think otherwise, please go away, go do something else more productive, this is a waste of time. You are not fixing something, you are not helping me code, you are actually making things worse for me as a developer. Please stop that!
>
>
>
> C++ is English, keywords like ‘if’, ‘else’, ‘while’, that’s English. Sure, there’s nothing to stop people from writing function names and identifiers in Korean, Japanese, or other scripts, I’m all for it, power to them.
>
> Your script doesn’t work with C++? Tough!
>
> You want to make a fuss because you can’t have poop emoji as a type name? Get bent!
>
>
>
> What else is there left that is even relevant to discuss about unicode?
>
>
>
> Whit all of this I can still write algorithms and process text in utf8 or Unicode or whatever other encoding I want. I can still write code to process and handle all the intricacies of Unicode without surprises just fine. What do I need this for? Tell how does this help me write code?
>
>
>
>
>
> This is something that I have told many of my coworkers when dealing with issue with design. Far more important than the things you add are the things that you don’t.
>
> How about we don’t do this?
>
>
>
> Just trim down the inconsistencies that currently exist to make things well defined, and let’s not shoot ourselves in the foot by trying to force support of all inane things of a broken standard.
>
> You really need to reconsider the priorities here.
>
>
>
>
>> Getting a name from reflection:
>
>> We can't know how the string will be used, so it needs to follow the rules of C++: Either it is a u8 string, and is utf-8 encoding, or it is a non-utf string in the literal encoding (might be ebcdic, etc). Only utf-8 (or another unicode encoding) can represent all identifiers.
>
>
>
> I completely disagree, using u8 should only indicate that the underlying character type is char8_t, not that there’s an utf-8 encoding of the underlying character sequence.
Well, that's not the status quo of the C++ standard.
See table [tab:lex.string.literal] in [lex.string].
If you feel that C++ is taking the wrong direction in this matter
(or any of the other matters you voice below), please write
a paper with rationale, directed at SG16.
I'm seeing a lot of venting below, but rather little rationale,
and also rather few constructive alternatives.
> I think the approach here is being overly complicated.
>
> There’s already an industry standard way of identifying the encoding of a file, and that’s by using a BOM, if the BOM is missing from the file we shouldn’t assume by default that the file is encoded in utf8.
>
> Not that I think that this should play any relevance to the C++ language at all, it is the responsibility of the compiler to identify the encoding of the file, decide if it wants to support it, and then translate it to something that can normatively be interpreted,
>
> only after that point C++ standards should have any relevancy. If the file is incorrectly encoded, it is the compiler responsibility to deal with the encoding and reject the file, and C++ standard shouldn’t play a role in this at all.
And that's what C++ says in [lex.phases] p1.1, which we took great
effort to phrase in a way to enable the model you're describing.
Do you have any particular concern with that wording?
> I’m going to be honest about my personal opinion about this, Unicode is a complete and utter mess, it’s garbage as a standard, it fails at what it supposed to do, and we shouldn’t have to suffer because of the inanities of that “optional to use” standard.
There is nothing "optional" about the Unicode standard; it's a
normative reference for C++; see [intro.refs].
> If somebody else comes up with a better standard, and they can code C++ using it right out of the gate the better.
And nothing in the C++ standard prevents anyone from doing that.
> I think it is far more productive to allow developers to use whatever encoding that they want, we should be helping developers to write code, not tell them what code to write. And hopefully they will do something better than what we have right now.
>
>
> Having had that little rant, that doesn’t mean that Unicode can not be supported. This is actually quite easy, and you actually see it everywhere.
>
> The way you do it is:
>
> *BY NOT EXPLICITLY SUPPORTING IT*
There is a desire to portably express code points (characters, if you want)
in non-English writing systems, without requiring such code points to be
encoded in the source file in something outside of the basic character set.
We need a way to identify such code points / characters, and the one
comprehensive catalog for writing systems is Unicode, so C++ (ever since
C++98) chose to use its numbering (and, since recently, its naming) for
identifying such characters portably.
I'm not aware of any other similarly comprehensive catalog. Do you
have an alternative in mind?
> Do you want to encode emoji in string literals? No problem, a valid multi-byte code point in utf8 does not have bytes with values bellow 0x7F, we don’t actually need to care,
>
> whatever bytes are there is what ends-up in the binary, the fact that it is supposed to be utf8 is irrelevant, the standard shouldn’t care, it’s the compilers responsibility to convert the code point back to its byte sequence.
>
>
>
> The file has an invalid utf8 byte sequence? That’s the compilers problem to reject the file because it is incorrectly encoded, the encoding of the file should be delt upfront by the reading algorithm, it shouldn’t even get to the point where C++ interpretation even matters.
>
>
>
> The file is actually encoded in UCS4 and it has a code point 1533 to be encoded in place where there should be a 8bit character? If it’s a single character then it’s the compiler responsibility to fail to compile, because it can’t encode a 1533 into 1 byte, the fact that the language is C++ is irrelevant here.
>
> Is it actually supposed to be in a string? We can define that such sequences should *PREFERENTIALLY* be convert to an utf8 sequence from whatever encoding, but it’s the responsibility of the compiler that decides to support that source encoding to do the conversion correctly. If it can’t do that, then maybe the compiler shouldn’t support it.
You seem to be talking about [lex.phases] p1.1 a lot; I'm not seeing new information here.
Jens
> The source file is using an encoding that doesn’t support a specific control character? There must be a replacement for it somehow, if the compiler supports it, then it is responsible for providing a valid translation. It can’t do that, then tough! How about not support it? That’s a perfectly valid option. Not a C++ problem.
>
> If you disagree, I can do the ridiculous of inventing an encoding format that can only encode the character ‘0’ and ‘!’, good luck supporting that!
>
> (someone just inventing an encoding and people deciding to use it is how all encodings came to be, there’s nothing to say that they are not broken in the same ridiculous way as exemplified, but with extra steps, stop trying to support everything)
>
>
>
> Do you want to avoid character look alike attacks? That’s exclusively a human problem, it is the problem of the IDE to highlight those for the human. Visual studio code has that functionality, and it works quite well to identify those for me, other IDE’s should follow suit, you can even make tools to flag those. Again, the C++ standard shouldn’t care. If it goes trough the compiler, good or bad, that’s what you get add a code analysis tool to stop that.
>
>
>
> Do you want to use weird combining marks that use bytes that map to control points in C++ but they are not intended to be interpreted as control characters? Well, I guess you cannot do that!
>
> Are you guys really serious about writing identifiers with emoji? For real? How is this helpful? Common guys, what are we doing?
>
> I don’t want to have to read emoji as code, EVER! It shouldn’t be a thing. If you think otherwise, please go away, go do something else more productive, this is a waste of time. You are not fixing something, you are not helping me code, you are actually making things worse for me as a developer. Please stop that!
>
>
>
> C++ is English, keywords like ‘if’, ‘else’, ‘while’, that’s English. Sure, there’s nothing to stop people from writing function names and identifiers in Korean, Japanese, or other scripts, I’m all for it, power to them.
>
> Your script doesn’t work with C++? Tough!
>
> You want to make a fuss because you can’t have poop emoji as a type name? Get bent!
>
>
>
> What else is there left that is even relevant to discuss about unicode?
>
>
>
> Whit all of this I can still write algorithms and process text in utf8 or Unicode or whatever other encoding I want. I can still write code to process and handle all the intricacies of Unicode without surprises just fine. What do I need this for? Tell how does this help me write code?
>
>
>
>
>
> This is something that I have told many of my coworkers when dealing with issue with design. Far more important than the things you add are the things that you don’t.
>
> How about we don’t do this?
>
>
>
> Just trim down the inconsistencies that currently exist to make things well defined, and let’s not shoot ourselves in the foot by trying to force support of all inane things of a broken standard.
>
> You really need to reconsider the priorities here.
>
>
Received on 2024-04-20 11:58:46