Date: Sat, 20 Apr 2024 20:37:12 +0200
On 20/04/2024 16.30, Tiago Freire wrote:
>
>> Well, that's not the status quo of the C++ standard.
>> See table [tab:lex.string.literal] in [lex.string].
>> If you feel that C++ is taking the wrong direction in this matter (or any of the other matters you voice below), please write a paper with rationale, directed at SG16.
>
> Sure, I can write a paper. And yes, and I think the direction is kind of a problem.
> I want to use char8_t/char16_t/char32_t for other encodings that are not Unicode.
> The fact that it explicitly lists utf-8/utf-16/utf-32 is a problem, specifically considering that you manipulate these things at the byte/word level without having to conform to any specification (including unicode).
> Having things like filesystem::path specifying explicit encodings for utf8/utf16 and that there must be a conversion that can be different between char/wchar_t and char8_t/char16_t types, despite the fact that no filesystem is Unicode
There seem to be case-insensitive filesystems at least on Linux
(for the better or worse), and those certainly need to be encoding-aware.
https://lwn.net/Articles/754508/
Unicode / UTF-8 seems to be the natural choice for that.
Also, my understanding is that Windows has case-insensitive filesystems
by default.
> , plus it doesn't even specify what algorithm is used to perform the conversion, is a problem.
Yes. There are efforts underway to offer transcoding functions between
UTF-8 and other encodings. If you want to help, please have a look at
the pending papers and talk to the respective authors.
> There are many more instances of references to Unicode, with incorrect behavior, is a problem.
Care to list them, with details on the "incorrect behavior" part?
Generic comments don't help much in fixing particular issues.
> I think that papers such as p1949r7, that seek to eliminate specific code points because they cause problems when you combine C++ with Unicode, is a problem.
Regardless of Unicode, you seem to agree that people should be
able to express identifiers using non-English scripts. In C++,
identifiers are separated by whitespace from adjacent
identifiers. It seems very confusing to me if two identifiers
are separated by visual whitespace, but instead this is
interpreted as a single identifier.
Furthermore, there are also ideas to use mathematical operator
symbols as C++ operators. If those can become part of regular
identifiers, that avenue of future evolution is blocked.
> There’s a lot of effort to accommodate more Unicode, this adds specific things that are allowed or not allowed into the language, when it doesn’t have too.
I believe there are some aspects of non-English scripts that
need to be restricted; as far as I have understood, WG21 did
not feel comfortable with having a long-term maintenance
burden in that area and thus deferred the rule-making to the
Unicode standard. Other rule sets are plausible, but I don't
agree we can live with no rules at all for identifiers
expressed in non-English scripts.
>> I'm seeing a lot of venting below, but rather little rationale, and also rather few constructive alternatives.
>
> But there is a constructive alternative here. The constructive alternative, the thing that I am saying that you should do, just happens to be to “not to do something”.
> Let me make that clear by answering the following question:
>
>> I'm not aware of any other similarly comprehensive catalog. Do you have an alternative in mind?
> Yes, I have a better alternative in mind, don’t have one.
So, how would you express the semantics of universal-character-names,
then?
> But it could have been any other encoding that satisfies all the requirements for the C++ base codepoint set, and those encodings shouldn’t be penalized because you have explicitly disallowed certain code points due to issues in utf-8 when utf-8 is not even used at all.
UTF-8 is not the issue here; UTF-8 is just an encoding. The Unicode
rules about forming valid identifiers are on the code point level.
Can you give specific examples where you believe a plausible
identifier is prevented by the Unicode identifier rules?
Note that we already have rules in C++ using just the English
language where not every valid English word is an identifier.
For example, "one-way" is a valid English compound adjective,
but not a valid spelling for an identifier.
> You support everyone by not explicitly supporting anything. That’s the feature.
> You allow for unicode by removing references to unicode from the language, so you don't have to chase after new unicode standard updates or have to take care of increasingly weird exceptions.
We would actually love to do that, but we have features in the
language proper (not just concerning the encoding of source files)
where we need to refer to a particular character repertoire.
> Adding Unicode support to the language should start by providing conversion functions that allows you to convert between different encodings, or that provide some checks or guarantees on runtime data that a user is free to not use if they want to use something else.
Agreed on that part. So, do you have specific concerns about the
proposals for providing exactly those conversion functions,
or at least a subset thereof?
Jens
>
>> Well, that's not the status quo of the C++ standard.
>> See table [tab:lex.string.literal] in [lex.string].
>> If you feel that C++ is taking the wrong direction in this matter (or any of the other matters you voice below), please write a paper with rationale, directed at SG16.
>
> Sure, I can write a paper. And yes, and I think the direction is kind of a problem.
> I want to use char8_t/char16_t/char32_t for other encodings that are not Unicode.
> The fact that it explicitly lists utf-8/utf-16/utf-32 is a problem, specifically considering that you manipulate these things at the byte/word level without having to conform to any specification (including unicode).
> Having things like filesystem::path specifying explicit encodings for utf8/utf16 and that there must be a conversion that can be different between char/wchar_t and char8_t/char16_t types, despite the fact that no filesystem is Unicode
There seem to be case-insensitive filesystems at least on Linux
(for the better or worse), and those certainly need to be encoding-aware.
https://lwn.net/Articles/754508/
Unicode / UTF-8 seems to be the natural choice for that.
Also, my understanding is that Windows has case-insensitive filesystems
by default.
> , plus it doesn't even specify what algorithm is used to perform the conversion, is a problem.
Yes. There are efforts underway to offer transcoding functions between
UTF-8 and other encodings. If you want to help, please have a look at
the pending papers and talk to the respective authors.
> There are many more instances of references to Unicode, with incorrect behavior, is a problem.
Care to list them, with details on the "incorrect behavior" part?
Generic comments don't help much in fixing particular issues.
> I think that papers such as p1949r7, that seek to eliminate specific code points because they cause problems when you combine C++ with Unicode, is a problem.
Regardless of Unicode, you seem to agree that people should be
able to express identifiers using non-English scripts. In C++,
identifiers are separated by whitespace from adjacent
identifiers. It seems very confusing to me if two identifiers
are separated by visual whitespace, but instead this is
interpreted as a single identifier.
Furthermore, there are also ideas to use mathematical operator
symbols as C++ operators. If those can become part of regular
identifiers, that avenue of future evolution is blocked.
> There’s a lot of effort to accommodate more Unicode, this adds specific things that are allowed or not allowed into the language, when it doesn’t have too.
I believe there are some aspects of non-English scripts that
need to be restricted; as far as I have understood, WG21 did
not feel comfortable with having a long-term maintenance
burden in that area and thus deferred the rule-making to the
Unicode standard. Other rule sets are plausible, but I don't
agree we can live with no rules at all for identifiers
expressed in non-English scripts.
>> I'm seeing a lot of venting below, but rather little rationale, and also rather few constructive alternatives.
>
> But there is a constructive alternative here. The constructive alternative, the thing that I am saying that you should do, just happens to be to “not to do something”.
> Let me make that clear by answering the following question:
>
>> I'm not aware of any other similarly comprehensive catalog. Do you have an alternative in mind?
> Yes, I have a better alternative in mind, don’t have one.
So, how would you express the semantics of universal-character-names,
then?
> But it could have been any other encoding that satisfies all the requirements for the C++ base codepoint set, and those encodings shouldn’t be penalized because you have explicitly disallowed certain code points due to issues in utf-8 when utf-8 is not even used at all.
UTF-8 is not the issue here; UTF-8 is just an encoding. The Unicode
rules about forming valid identifiers are on the code point level.
Can you give specific examples where you believe a plausible
identifier is prevented by the Unicode identifier rules?
Note that we already have rules in C++ using just the English
language where not every valid English word is an identifier.
For example, "one-way" is a valid English compound adjective,
but not a valid spelling for an identifier.
> You support everyone by not explicitly supporting anything. That’s the feature.
> You allow for unicode by removing references to unicode from the language, so you don't have to chase after new unicode standard updates or have to take care of increasingly weird exceptions.
We would actually love to do that, but we have features in the
language proper (not just concerning the encoding of source files)
where we need to refer to a particular character repertoire.
> Adding Unicode support to the language should start by providing conversion functions that allows you to convert between different encodings, or that provide some checks or guarantees on runtime data that a user is free to not use if they want to use something else.
Agreed on that part. So, do you have specific concerns about the
proposals for providing exactly those conversion functions,
or at least a subset thereof?
Jens
Received on 2024-04-20 18:37:33