Date: Thu, 9 May 2024 02:47:57 +0300
Tom Honermann wrote:
> On 5/8/24 2:59 PM, Peter Dimov wrote:
>
>
> Tom Honermann wrote:
>
> This keeps neglecting the basic fact that there are
> implementations and
> ecosystems that cannot adopt what you are suggesting. Not now,
> not in the
> near term, probably never.
>
>
> Would you please give one concrete example of such an implementation
> or an ecosystem, and how translating Unicode literals to the _ordinary_
> literal encoding on stream insertion would be a problem there?
>
> Any EBCDIC based system like z/OS.
OK, let's go with that.
> C++ code can't distinguish between literals and non-literals (except for UDLs,
> but that is irrelevant here), but I don't think you intended to constrain the
> question to Unicode literals.
>
>
> UTF-8 solves problems with mojibake. It does not solve problems with
> translations. Let's go back to a variation of an example I gave earlier that uses a
> hypothetical message catalog similar to GNU gettext() to provide translations
> of strings in UTF-8 in char8_t.
>
>
> std::cout << u8msg("In the month of ") << std::chrono::August << "\n";
This code doesn't work today, right?
So we're talking about - hypothetically - making this code possible to write,
in the year 2027, on z/OS systems.
> Say the ordinary literal encoding is IBM-1047. Translation to the ordinary
> literal encoding will limit the output to characters representable in that
> encoding; any other characters would presumably be replaced with
> substitution characters. If the program is run in an IBM-1047 environment,
> there is no problem. Now run that program in an environment with a
> Japanese locale using code page 954 (euc-jp). The message catalog lookup
> would produce a UTF-8 string that probably uses characters not in IBM-1047.
> Conversion to code page 954 will likely preserve those characters while
> conversion to IBM-1047 definitely would not.
Correct.
But this means that the program can't output any literals to std::cout, not
even spaces or punctuation. In fact, your code above won't work, if I'm not
mistaken, because "\n" is 0x25 0x00 in EBCDIC and 0x25 is '%' in ASCII/EUC-JP.
So what's the bottom line?
You want to make it possible for newly written C++ code on z/OS to
be able to output char8_t* to std::cout, but not ordinary literals, including
spaces and punctuation, e.g. "\n". The price you want to pay for that is the
addition of new locale machinery that has no corresponding POSIX locale
category.
And this comes at everyone else's expense, the part of the C++
community who have no use for it.
Forgive me, but I don't see the justification here.
(Note that environments where the literal encoding and the runtime
code page match, which is the majority of code page use, don't fit
the above scenario. They can safely output a mix of literals, messages,
and months to std::cout.)
Ah, you say, I'll just use u8 literals everywhere:
std::cout << u8msg("In the month of ") << std::chrono::August << u8"\n";
et voila.
OK, but if you are reduced to that, why even insist on EBCDIC for your
literal encoding? You can't use it anywhere, and if you do by mistake,
as in the above, everything happily compiles and does the wrong thing.
TL;DR I don't think that trying to support (hypothetical future) environments
where
std::cout << u8"\n";
and
std::cout << "\n";
do different things, is productive use of our time, or of net benefit to the
C++ community.
> On 5/8/24 2:59 PM, Peter Dimov wrote:
>
>
> Tom Honermann wrote:
>
> This keeps neglecting the basic fact that there are
> implementations and
> ecosystems that cannot adopt what you are suggesting. Not now,
> not in the
> near term, probably never.
>
>
> Would you please give one concrete example of such an implementation
> or an ecosystem, and how translating Unicode literals to the _ordinary_
> literal encoding on stream insertion would be a problem there?
>
> Any EBCDIC based system like z/OS.
OK, let's go with that.
> C++ code can't distinguish between literals and non-literals (except for UDLs,
> but that is irrelevant here), but I don't think you intended to constrain the
> question to Unicode literals.
>
>
> UTF-8 solves problems with mojibake. It does not solve problems with
> translations. Let's go back to a variation of an example I gave earlier that uses a
> hypothetical message catalog similar to GNU gettext() to provide translations
> of strings in UTF-8 in char8_t.
>
>
> std::cout << u8msg("In the month of ") << std::chrono::August << "\n";
This code doesn't work today, right?
So we're talking about - hypothetically - making this code possible to write,
in the year 2027, on z/OS systems.
> Say the ordinary literal encoding is IBM-1047. Translation to the ordinary
> literal encoding will limit the output to characters representable in that
> encoding; any other characters would presumably be replaced with
> substitution characters. If the program is run in an IBM-1047 environment,
> there is no problem. Now run that program in an environment with a
> Japanese locale using code page 954 (euc-jp). The message catalog lookup
> would produce a UTF-8 string that probably uses characters not in IBM-1047.
> Conversion to code page 954 will likely preserve those characters while
> conversion to IBM-1047 definitely would not.
Correct.
But this means that the program can't output any literals to std::cout, not
even spaces or punctuation. In fact, your code above won't work, if I'm not
mistaken, because "\n" is 0x25 0x00 in EBCDIC and 0x25 is '%' in ASCII/EUC-JP.
So what's the bottom line?
You want to make it possible for newly written C++ code on z/OS to
be able to output char8_t* to std::cout, but not ordinary literals, including
spaces and punctuation, e.g. "\n". The price you want to pay for that is the
addition of new locale machinery that has no corresponding POSIX locale
category.
And this comes at everyone else's expense, the part of the C++
community who have no use for it.
Forgive me, but I don't see the justification here.
(Note that environments where the literal encoding and the runtime
code page match, which is the majority of code page use, don't fit
the above scenario. They can safely output a mix of literals, messages,
and months to std::cout.)
Ah, you say, I'll just use u8 literals everywhere:
std::cout << u8msg("In the month of ") << std::chrono::August << u8"\n";
et voila.
OK, but if you are reduced to that, why even insist on EBCDIC for your
literal encoding? You can't use it anywhere, and if you do by mistake,
as in the above, everything happily compiles and does the wrong thing.
TL;DR I don't think that trying to support (hypothetical future) environments
where
std::cout << u8"\n";
and
std::cout << "\n";
do different things, is productive use of our time, or of net benefit to the
C++ community.
Received on 2024-05-08 23:48:01