ISOCPP sg16 List: Re: Agenda for the 2024-04-24 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 21 Apr 2024 00:25:58 -0400

Thank you for sharing your thoughts, Tiago.

Please note that this mailing list is intended for technical
discussions. As such, I expect thoughtful rationale and evidence to
accommodate directional assertions and critiques of others work.

Members of this mailing list are unlikely to be persuaded by subjective
statements that such-and-such is an "utter mess" or "garbage", but you
might find agreement by pointing out specific areas that you think are
deficient or in need of improvement. Please take a few minutes to
re-read what you sent from the perspective of a recipient. Would you
find it persuasive? Would it lead you to believe that the sender is a
well-informed expert on the subject matter? Please strive to contribute
more constructive commentary. Any other messages that you send that I
determine, at my own discretion, to contain non-constructive
non-technical commentary will result in your future posts being moderated.

I'll try to respond to some of the items you mentioned below.

On 4/20/24 6:33 AM, Tiago Freire wrote:
>
> > Getting a name from reflection:
>
> > We can't know how the string will be used, so it needs to follow
> the rules of C++: Either it is a u8 string, and is utf-8 encoding, or
> it is a non-utf string in the literal encoding (might be ebcdic, etc).
> Only utf-8 (or another unicode encoding) can represent all identifiers.
>
> I completely disagree, using u8 should only indicate that the
> underlying character type is char8_t, not that there’s an utf-8
> encoding of the underlying character sequence.
>
char8_t was specifically and explicitly introduced to support UTF-8. See
P0482R6 (char8_t: A type for UTF-8 characters and strings (Revision 6))
<https://wg21.link/p0482r6>. It is true that a sequence of char8_t
elements might not hold a well-formed sequence of UTF-8 code units.
However, there is no question of what encoding to use with a sequence of
char8_t; it is unambiguously always UTF-8.
>
> I have been seeing this discussion and I have to express my
> frustration and disappointment with the whole thing.
>
> I think the following needs to be said.
>
> I think the approach here is being overly complicated.
>
> There’s already an industry standard way of identifying the encoding
> of a file, and that’s by using a BOM, if the BOM is missing from the
> file we shouldn’t assume by default that the file is encoded in utf8.
>
> Not that I think that this should play any relevance to the C++
> language at all, it is the responsibility of the compiler to identify
> the encoding of the file, decide if it wants to support it, and then
> translate it to something that can normatively be interpreted,
>
> only after that point C++ standards should have any relevancy. If the
> file is incorrectly encoded, it is the compiler responsibility to deal
> with the encoding and reject the file, and C++ standard shouldn’t play
> a role in this at all.
>
Based on these statements and many of the paragraphs that follow, I
believe you have a misunderstanding regarding the encoding model used by
the C++ Standard and by existing compilers.

First, the reflections proposal and the topics up for discussion here
have nothing to do with the encoding of a source file.

The encoding model used by the C++ standard and by existing
implementations is that a source file has an associated encoding that is
used to derive the sequence of encoded characters that constitute the
input to the lexer. The (decoded) characters that are present within
character and string literals are transcoded from the source file
encoding to the associated encoding of the string literal type (the
ordinary literal encoding, UTF-8 for u8 prefixed literals, etc...). I
don't think this process is well understood by most programmers. This
model is used by virtually all programming languages and compilers.

The C++ Standard explicitly states, in translation phase 1
([lex.phases]p1.1 <http://eel.is/c++draft/lex.phases#1.1>), that
implementations may support "any other kind of input file". Perhaps that
wording could be improved, but the intent is to give implementations
permission to support whatever source file encodings they deem useful.
The C++ Standard does not place any restrictions on this. Existing
implementations accept a wide variety of source file encodings including
numerous ones that derive from the ASCII and EBCDIC encodings. This
support is intentional.

I'm not going to reply to any of your other statements below for two
reasons; 1) many of them are not helpful, not constructive, not
respectful of other members of the community, and not correct (for
example, the changes made in P1949R7 <https://wg21.link/p1949r7> removed
existing allowances for emoji in identifiers; there is no effort to add
such support), 2) per above, many of them appear to be based on a
misconception of how encodings are handled in C++ (and programming
languages in general).

I am happy to answer any questions you might have about the encoding model.

Tom.

> I’m going to be honest about my personal opinion about this, Unicode
> is a complete and utter mess, it’s garbage as a standard, it fails at
> what it supposed to do, and we shouldn’t have to suffer because of the
> inanities of that “optional to use” standard.
>
> And I personally would like to see it be superseded by something else
> that is sensible. I loathe it and I hope it dies.
>
> I would like to point that we are only having this discussion because
> of stupid ideas that got into Unicode. Were it sensible, we would not
> be here.
>
> We shouldn’t be giving special treatment to Unicode just because it is
> popular, nor should we rest on our laurels thinking that this problem
> is done and solved.
>
> If somebody else comes up with a better standard, and they can code
> C++ using it right out of the gate the better.
>
> I think it is far more productive to allow developers to use whatever
> encoding that they want, we should be helping developers to write
> code, not tell them what code to write. And hopefully they will do
> something better than what we have right now.
>
> Having had that little rant, that doesn’t mean that Unicode can not be
> supported. This is actually quite easy, and you actually see it
> everywhere.
>
> The way you do it is:
>
> *BY NOT EXPLICITLY SUPPORTING IT*
>
> This is not my first rodeo with Unicode, to me this is a solved
> problem, this approach works, it works exceedingly well.
>
> The answer is not more unicode, its less.
>
> All control characters in C++ should be under 0x7F, utf-8 here is
> irrelevant.
>
> Do you want an escape sequences like \u2057 to add a multi-byte
> encoded utf8 into an 8bit character type string? Sure, we can support
> that, there’s a well define way to do that. Weird corner case don’t
> even apply here. That’s the only place where utf8 should even be relevant.
>
> Do you want to encode emoji in string literals? No problem, a valid
> multi-byte code point in utf8 does not have bytes with values bellow
> 0x7F, we don’t actually need to care,
>
> whatever bytes are there is what ends-up in the binary, the fact that
> it is supposed to be utf8 is irrelevant, the standard shouldn’t care,
> it’s the compilers responsibility to convert the code point back to
> its byte sequence.
>
> The file has an invalid utf8 byte sequence? That’s the compilers
> problem to reject the file because it is incorrectly encoded, the
> encoding of the file should be delt upfront by the reading algorithm,
> it shouldn’t even get to the point where C++ interpretation even matters.
>
> The file is actually encoded in UCS4 and it has a code point 1533 to
> be encoded in place where there should be a 8bit character? If it’s a
> single character then it’s the compiler responsibility to fail to
> compile, because it can’t encode a 1533 into 1 byte, the fact that the
> language is C++ is irrelevant here.
>
> Is it actually supposed to be in a string? We can define that such
> sequences should *PREFERENTIALLY* be convert to an utf8 sequence from
> whatever encoding, but it’s the responsibility of the compiler that
> decides to support that source encoding to do the conversion
> correctly. If it can’t do that, then maybe the compiler shouldn’t
> support it.
>
> The source file is using an encoding that doesn’t support a specific
> control character? There must be a replacement for it somehow, if the
> compiler supports it, then it is responsible for providing a valid
> translation. It can’t do that, then tough! How about not support it?
> That’s a perfectly valid option. Not a C++ problem.
>
> If you disagree, I can do the ridiculous of inventing an encoding
> format that can only encode the character ‘0’ and ‘!’, good luck
> supporting that!
>
> (someone just inventing an encoding and people deciding to use it is
> how all encodings came to be, there’s nothing to say that they are not
> broken in the same ridiculous way as exemplified, but with extra
> steps, stop trying to support everything)
>
> Do you want to avoid character look alike attacks? That’s exclusively
> a human problem, it is the problem of the IDE to highlight those for
> the human. Visual studio code has that functionality, and it works
> quite well to identify those for me, other IDE’s should follow suit,
> you can even make tools to flag those. Again, the C++ standard
> shouldn’t care. If it goes trough the compiler, good or bad, that’s
> what you get add a code analysis tool to stop that.
>
> Do you want to use weird combining marks that use bytes that map to
> control points in C++ but they are not intended to be interpreted as
> control characters? Well, I guess you cannot do that!
>
> Are you guys really serious about writing identifiers with emoji? For
> real? How is this helpful? Common guys, what are we doing?
>
> I don’t want to have to read emoji as code, EVER! It shouldn’t be a
> thing. If you think otherwise, please go away, go do something else
> more productive, this is a waste of time. You are not fixing
> something, you are not helping me code, you are actually making things
> worse for me as a developer. Please stop that!
>
> C++ is English, keywords like ‘if’, ‘else’, ‘while’, that’s English.
> Sure, there’s nothing to stop people from writing function names and
> identifiers in Korean, Japanese, or other scripts, I’m all for it,
> power to them.
>
> Your script doesn’t work with C++? Tough!
>
> You want to make a fuss because you can’t have poop emoji as a type
> name? Get bent!
>
> What else is there left that is even relevant to discuss about unicode?
>
> Whit all of this I can still write algorithms and process text in utf8
> or Unicode or whatever other encoding I want. I can still write code
> to process and handle all the intricacies of Unicode without surprises
> just fine. What do I need this for? Tell how does this help me write code?
>
> This is something that I have told many of my coworkers when dealing
> with issue with design. Far more important than the things you add are
> the things that you don’t.
>
> How about we don’t do this?
>
> Just trim down the inconsistencies that currently exist to make things
> well defined, and let’s not shoot ourselves in the foot by trying to
> force support of all inane things of a broken standard.
>
> You really need to reconsider the priorities here.
>

Received on 2024-04-21 04:26:07