ISOCPP sg16 List: Re: Agenda for the 2024-04-24 SG16 meeting

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Sat, 20 Apr 2024 20:12:26 +0000

> There seem to be case-insensitive filesystems at least on Linux (for the better or worse), and those certainly need to be encoding-aware.
> Also, my understanding is that Windows has case-insensitive filesystems by default.

They do need to be encoding aware? Why? I propose it to you that you don't.
Case-insensitive filesystems are a problem of a different nature.
In any case you would only be able to know for sure if the file system you are working with is case-insensitive at runtime.
The only thing that you can try to do where encoding makes sense is to try and figure out if two paths are the same but different only by case.
Which isn't helpful because you will only know if that is accurate after you query the filesystem, and filesystem like NTFS even tough can be case-insensitive it is not unicode,
looking up a Unicode standard to try and map which codepoints are upper/lower case versions of other codepoints would be incorrect, you would need to look up the NTFS rules to figure out what it considers to be the same.
Even if the designer of the filesystem made best efforts to follow unicode, it can just be broken by a unicode update.
The only correct way to do it is to ask the operating system if they are indeed the same.

> Regardless of Unicode, you seem to agree that people should be able to express identifiers using non-English scripts.

Yes, but. As far as I'm concerned this is an optional feature, you can have it, but we shouldn't explicitly support it.
The encoding must prove itself to be fit for purpose, in this case for the purpose of expressing C++ code, if the encoding can't support it we shouldn't bend over backwards to accommodate it.
I don't advocate to change key words like "if", "void", "int", or change the function names from the standard library to use other scripts.
If your script doesn't work, that’s too bad, learn English, you will never going to be able to get rid of that requirement.

> In C++, identifiers are separated by whitespace from adjacent identifiers. It seems very confusing to me if two identifiers are separated by visual whitespace, but instead this is interpreted as a single identifier.

And this is a uniquely human problem created by bad script standard. The interpreter doesn't care that it looks like a white space. Have your IDE highlight that as a problem, or have a code analysis tool to pick that up.
Don't single out additional code units that look like spaces in unicode because of this, the number of unprintable code points in unicode is enormous, and the set will change whenever there's an update.
You will perpetually be playing whack-a-mole with this.
Let's not do it.

> Furthermore, there are also ideas to use mathematical operator symbols as C++ operators. If those can become part of regular identifiers, that avenue of future evolution is blocked.

Why? That's a terrible idea. Let's not do that. You can't even naturally type in those mathematical symbols with your keyboard.
Let it be blocked, it shouldn't be an avenue for future evolution. Keep it simple, if it is above 0x7F its dead, the language shouldn't know or have to care what it is.

> WG21 did not feel comfortable with having a long-term maintenance burden in that area and thus deferred the rule-making to the Unicode standard. Other rule sets are plausible, but I don't agree we can live with no rules at all for identifiers expressed in non-English scripts.

I agree with WG21, we shouldn't have that burden. I can live perfectly fine with no rules. Not everything needs to have a rule.
Naming functions in latin script as "aaaaa" or "aa" or "awerhtiuhfg", are a terrible idea, the language does not disallow that, but it is a terrible idea so I just don't do that. There doesn't need to be an explicitly enforced rule that stops me from doing it.
Let developers' police themselves, trust that they don't do stuff that is bad, if they do its their problem. We don't need a script gestapo.

> So, how would you express the semantics of universal-character-names, then?

You don't. And you shouldn't. That's another terrible idea.
Why should I be forced to format with unicode? Why are you telling me what code I should write instead of helping me write code?
That defeats the goal of trying to be encoding agnostic, now you can't work with a different encoding. C++ shouldn't be the language of Unicode. This needs to stop.
It should allow the usage of unicode, it shouldn't require unicode to use.

> Can you give specific examples where you believe a plausible identifier is prevented by the Unicode identifier rules?

A zero-width-joiner with code point 0x200D which is suggested to be disallowed, could have a different meaning in a different encoding.

>> Adding Unicode support to the language should start by providing conversion functions that allows you to convert between different encodings, or that provide some checks or guarantees on runtime data that a user is free to not use if they want to use something else.
> Agreed on that part. So, do you have specific concerns about the proposals for providing exactly those conversion functions, or at least a subset thereof?

No, I agree, we should add those, we should totally facilitate users to be able to manipulate unicode data to their hearts content, and we should facilitate that as a library to users.
And that is not as hard as a problem because there's nothing special about zero-width-joiners, or emoji, when manipulating text, and allot of the unicode evolution isn't even relevant here.
Just don't use unicode in the C++ lexer, don't make it the mandatory way of formatting things either.

Received on 2024-04-20 20:12:31