Date: Wed, 8 May 2024 21:29:44 -0400
On 5/8/24 7:47 PM, Peter Dimov wrote:
> Tom Honermann wrote:
>> On 5/8/24 2:59 PM, Peter Dimov wrote:
>>
>>
>> Tom Honermann wrote:
>>
>> This keeps neglecting the basic fact that there are
>> implementations and
>> ecosystems that cannot adopt what you are suggesting. Not now,
>> not in the
>> near term, probably never.
>>
>>
>> Would you please give one concrete example of such an implementation
>> or an ecosystem, and how translating Unicode literals to the _ordinary_
>> literal encoding on stream insertion would be a problem there?
>>
>> Any EBCDIC based system like z/OS.
> OK, let's go with that.
>
>> C++ code can't distinguish between literals and non-literals (except for UDLs,
>> but that is irrelevant here), but I don't think you intended to constrain the
>> question to Unicode literals.
>>
>>
>> UTF-8 solves problems with mojibake. It does not solve problems with
>> translations. Let's go back to a variation of an example I gave earlier that uses a
>> hypothetical message catalog similar to GNU gettext() to provide translations
>> of strings in UTF-8 in char8_t.
>>
>>
>> std::cout << u8msg("In the month of ") << std::chrono::August << "\n";
> This code doesn't work today, right?
>
> So we're talking about - hypothetically - making this code possible to write,
> in the year 2027, on z/OS systems.
Are you under the impression that no one is developing C++ code for
these systems today? IBM provides two C++ compilers for z/OS
<https://www.ibm.com/products/xl-cpp-compiler-zos> and they are busy
contributing support to LLVM/Clang. Dignus provides an LLVM based
compiler for z/OS <http://www.dignus.com/press_releases/200728.html>.
These systems are not ubiquitous, but they are not a historical artifact
of a bygone era either.
>
>> Say the ordinary literal encoding is IBM-1047. Translation to the ordinary
>> literal encoding will limit the output to characters representable in that
>> encoding; any other characters would presumably be replaced with
>> substitution characters. If the program is run in an IBM-1047 environment,
>> there is no problem. Now run that program in an environment with a
>> Japanese locale using code page 954 (euc-jp). The message catalog lookup
>> would produce a UTF-8 string that probably uses characters not in IBM-1047.
>> Conversion to code page 954 will likely preserve those characters while
>> conversion to IBM-1047 definitely would not.
Sigh, I was in a rush and pasted from the wrong reference. I didn't mean
to refer to an ASCII-based encoding; that would indeed lead to madness
like you described below. I had a code page like IBM-1027 in mind.
As mentioned previously, a code page based programming model relies on
the concept of an invariant character set. In Windows world and for most
POSIX systems, the invariant character set is ASCII. EBCDIC world is
more complicated, relies on documentation, and has exceptions. See
https://www.ibm.com/docs/en/i/7.5?topic=sets-invariant-character-set-its-exceptions.
If you contrast the EBCDIC invariant character set as shown here
<https://www.ibm.com/docs/en/i/7.5?topic=sets-invariant-character-set-its-exceptions>
with the C++ /basic character set
<http://eel.is/c++draft/lex.charset#def:character_set,basic>/, you'll
find that the set of characters that are in the latter, but not in the
former are exactly those for which we specify alternate tokens in
[lex.digraph] <http://eel.is/c++draft/lex.digraph> (with the exception
of the recently added '$', '@', and '`' characters). Of course, the
alternate tokens aren't relevant for this discussion since they are only
meaningful from a source file encoding perspective; I just mention it as
interesting history.
As for any of this coming at the expense of programmers that work on
other platforms, that is not my intent at all. I don't think the
behavior I'm arguing for would change the behavior that you (and I) want
for the environments that you are concerned about. If you believe that
not to be the case, then we need to clarify where this would lead to
unwanted behavior (that doesn't already happen). Perhaps it would be
helpful to detail the relevant scenarios; I can probably do that tomorrow.
I'm not going to respond to anything else below since it was based on a
miscommunication on my part.
Tom.
> Correct.
>
> But this means that the program can't output any literals to std::cout, not
> even spaces or punctuation. In fact, your code above won't work, if I'm not
> mistaken, because "\n" is 0x25 0x00 in EBCDIC and 0x25 is '%' in ASCII/EUC-JP.
>
> So what's the bottom line?
>
> You want to make it possible for newly written C++ code on z/OS to
> be able to output char8_t* to std::cout, but not ordinary literals, including
> spaces and punctuation, e.g. "\n". The price you want to pay for that is the
> addition of new locale machinery that has no corresponding POSIX locale
> category.
>
> And this comes at everyone else's expense, the part of the C++
> community who have no use for it.
>
> Forgive me, but I don't see the justification here.
>
> (Note that environments where the literal encoding and the runtime
> code page match, which is the majority of code page use, don't fit
> the above scenario. They can safely output a mix of literals, messages,
> and months to std::cout.)
>
> Ah, you say, I'll just use u8 literals everywhere:
>
> std::cout << u8msg("In the month of ") << std::chrono::August << u8"\n";
>
> et voila.
>
> OK, but if you are reduced to that, why even insist on EBCDIC for your
> literal encoding? You can't use it anywhere, and if you do by mistake,
> as in the above, everything happily compiles and does the wrong thing.
>
> TL;DR I don't think that trying to support (hypothetical future) environments
> where
>
> std::cout << u8"\n";
>
> and
>
> std::cout << "\n";
>
> do different things, is productive use of our time, or of net benefit to the
> C++ community.
>
>
> Tom Honermann wrote:
>> On 5/8/24 2:59 PM, Peter Dimov wrote:
>>
>>
>> Tom Honermann wrote:
>>
>> This keeps neglecting the basic fact that there are
>> implementations and
>> ecosystems that cannot adopt what you are suggesting. Not now,
>> not in the
>> near term, probably never.
>>
>>
>> Would you please give one concrete example of such an implementation
>> or an ecosystem, and how translating Unicode literals to the _ordinary_
>> literal encoding on stream insertion would be a problem there?
>>
>> Any EBCDIC based system like z/OS.
> OK, let's go with that.
>
>> C++ code can't distinguish between literals and non-literals (except for UDLs,
>> but that is irrelevant here), but I don't think you intended to constrain the
>> question to Unicode literals.
>>
>>
>> UTF-8 solves problems with mojibake. It does not solve problems with
>> translations. Let's go back to a variation of an example I gave earlier that uses a
>> hypothetical message catalog similar to GNU gettext() to provide translations
>> of strings in UTF-8 in char8_t.
>>
>>
>> std::cout << u8msg("In the month of ") << std::chrono::August << "\n";
> This code doesn't work today, right?
>
> So we're talking about - hypothetically - making this code possible to write,
> in the year 2027, on z/OS systems.
Are you under the impression that no one is developing C++ code for
these systems today? IBM provides two C++ compilers for z/OS
<https://www.ibm.com/products/xl-cpp-compiler-zos> and they are busy
contributing support to LLVM/Clang. Dignus provides an LLVM based
compiler for z/OS <http://www.dignus.com/press_releases/200728.html>.
These systems are not ubiquitous, but they are not a historical artifact
of a bygone era either.
>
>> Say the ordinary literal encoding is IBM-1047. Translation to the ordinary
>> literal encoding will limit the output to characters representable in that
>> encoding; any other characters would presumably be replaced with
>> substitution characters. If the program is run in an IBM-1047 environment,
>> there is no problem. Now run that program in an environment with a
>> Japanese locale using code page 954 (euc-jp). The message catalog lookup
>> would produce a UTF-8 string that probably uses characters not in IBM-1047.
>> Conversion to code page 954 will likely preserve those characters while
>> conversion to IBM-1047 definitely would not.
Sigh, I was in a rush and pasted from the wrong reference. I didn't mean
to refer to an ASCII-based encoding; that would indeed lead to madness
like you described below. I had a code page like IBM-1027 in mind.
As mentioned previously, a code page based programming model relies on
the concept of an invariant character set. In Windows world and for most
POSIX systems, the invariant character set is ASCII. EBCDIC world is
more complicated, relies on documentation, and has exceptions. See
https://www.ibm.com/docs/en/i/7.5?topic=sets-invariant-character-set-its-exceptions.
If you contrast the EBCDIC invariant character set as shown here
<https://www.ibm.com/docs/en/i/7.5?topic=sets-invariant-character-set-its-exceptions>
with the C++ /basic character set
<http://eel.is/c++draft/lex.charset#def:character_set,basic>/, you'll
find that the set of characters that are in the latter, but not in the
former are exactly those for which we specify alternate tokens in
[lex.digraph] <http://eel.is/c++draft/lex.digraph> (with the exception
of the recently added '$', '@', and '`' characters). Of course, the
alternate tokens aren't relevant for this discussion since they are only
meaningful from a source file encoding perspective; I just mention it as
interesting history.
As for any of this coming at the expense of programmers that work on
other platforms, that is not my intent at all. I don't think the
behavior I'm arguing for would change the behavior that you (and I) want
for the environments that you are concerned about. If you believe that
not to be the case, then we need to clarify where this would lead to
unwanted behavior (that doesn't already happen). Perhaps it would be
helpful to detail the relevant scenarios; I can probably do that tomorrow.
I'm not going to respond to anything else below since it was based on a
miscommunication on my part.
Tom.
> Correct.
>
> But this means that the program can't output any literals to std::cout, not
> even spaces or punctuation. In fact, your code above won't work, if I'm not
> mistaken, because "\n" is 0x25 0x00 in EBCDIC and 0x25 is '%' in ASCII/EUC-JP.
>
> So what's the bottom line?
>
> You want to make it possible for newly written C++ code on z/OS to
> be able to output char8_t* to std::cout, but not ordinary literals, including
> spaces and punctuation, e.g. "\n". The price you want to pay for that is the
> addition of new locale machinery that has no corresponding POSIX locale
> category.
>
> And this comes at everyone else's expense, the part of the C++
> community who have no use for it.
>
> Forgive me, but I don't see the justification here.
>
> (Note that environments where the literal encoding and the runtime
> code page match, which is the majority of code page use, don't fit
> the above scenario. They can safely output a mix of literals, messages,
> and months to std::cout.)
>
> Ah, you say, I'll just use u8 literals everywhere:
>
> std::cout << u8msg("In the month of ") << std::chrono::August << u8"\n";
>
> et voila.
>
> OK, but if you are reduced to that, why even insist on EBCDIC for your
> literal encoding? You can't use it anywhere, and if you do by mistake,
> as in the above, everything happily compiles and does the wrong thing.
>
> TL;DR I don't think that trying to support (hypothetical future) environments
> where
>
> std::cout << u8"\n";
>
> and
>
> std::cout << "\n";
>
> do different things, is productive use of our time, or of net benefit to the
> C++ community.
>
>
Received on 2024-05-09 01:29:51
