ISOCPP sg16 List: Re: Agenda for the 2024-04-24 SG16 meeting

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Sat, 20 Apr 2024 14:30:42 +0000

> Well, that's not the status quo of the C++ standard.
> See table [tab:lex.string.literal] in [lex.string].
> If you feel that C++ is taking the wrong direction in this matter (or any of the other matters you voice below), please write a paper with rationale, directed at SG16.

Sure, I can write a paper. And yes, and I think the direction is kind of a problem.
I want to use char8_t/char16_t/char32_t for other encodings that are not Unicode.
The fact that it explicitly lists utf-8/utf-16/utf-32 is a problem, specifically considering that you manipulate these things at the byte/word level without having to conform to any specification (including unicode).
Having things like filesystem::path specifying explicit encodings for utf8/utf16 and that there must be a conversion that can be different between char/wchar_t and char8_t/char16_t types, despite the fact that no filesystem is Unicode, plus it doesn't even specify what algorithm is used to perform the conversion, is a problem.
The existence of functions like “vprint_unicode” and “vprint_nonunicode” given that there are no guarantees that either of them actually support Unicode or not support Unicode, or have anything to do with unicode, is a problem.
There are many more instances of references to Unicode, with incorrect behavior, is a problem.
I think that papers such as p1949r7, that seek to eliminate specific code points because they cause problems when you combine C++ with Unicode, is a problem.
There’s a lot of effort to accommodate more Unicode, this adds specific things that are allowed or not allowed into the language, when it doesn’t have too.

> I'm seeing a lot of venting below, but rather little rationale, and also rather few constructive alternatives.

But there is a constructive alternative here. The constructive alternative, the thing that I am saying that you should do, just happens to be to “not to do something”.
Let me make that clear by answering the following question:

> I'm not aware of any other similarly comprehensive catalog. Do you have an alternative in mind?
Yes, I have a better alternative in mind, don’t have one.
As long as it is mappable to a superset of ASCII, we are good. You just need to be able to identify control characters like “\, &, ^, !”, etc.., you need to be able to identify numbers (0, 1, 2, …) and some basic text (line feed, space , A, B, C, …), the rest is a bonus, it just accepts as is and doesn’t care what it is.
A user can use utf8 as a file encoding and encode codepoint above 0x7F, but as far as interpreting the language is concerned, it shouldn’t matter. Anything above 0x7F the language shouldn’t have anything to say about it. The fact that identifiers are written in katakana is a detail that is only relevant to an IDE drawing text to the screen to show it to a user.
But it could have been any other encoding that satisfies all the requirements for the C++ base codepoint set, and those encodings shouldn’t be penalized because you have explicitly disallowed certain code points due to issues in utf-8 when utf-8 is not even used at all.
You support everyone by not explicitly supporting anything. That’s the feature.
You allow for unicode by removing references to unicode from the language, so you don't have to chase after new unicode standard updates or have to take care of increasingly weird exceptions.
You see, these are the rules period, unicode supports that? Great you can use it. What version does it explicitly support? It doesn't. Doesn't matter.

Adding Unicode support to the language should start by providing conversion functions that allows you to convert between different encodings, or that provide some checks or guarantees on runtime data that a user is free to not use if they want to use something else.

Received on 2024-04-20 14:30:46