sg16: Re: [SG16] [isocpp-lib-ext] Questions for LEWG for P2093R4: Formatted output

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Tue, 27 Apr 2021 06:56:36 -0700

Sure, as I wrote:

> there might be issues with other cases (as expected)

This is another data point in favor of fixing locales.

- Victor

On Mon, Apr 26, 2021 at 1:15 PM Tom Honermann <tom_at_[hidden]> wrote:

> I don't think that test really exercises the concern Hubert raised.
> Please try this one instead (I don't have a convenient environment to try
> this myself). This one should print only the AM/PM designator for each of
> the locales. You might also try the "zh_CN" and "zh_HK" locales.
>
> #include <chrono>
> #include <format>
> #include <iostream>
>
> int main() {
> using namespace std::literals::chrono_literals;
> std::locale::global(std::locale("ja_JP"));
> std::cout << "ja_JP: " << std::format("{:%p}\n",
> std::chrono::system_clock::now().time_since_epoch());
> std::locale::global(std::locale("ja_JP.utf8"));
> std::cout << "ja_JP.utf8: " << std::format("{:%p}\n",
> std::chrono::system_clock::now().time_since_epoch());
> }
>
> I can, not surprisingly, produce mojibake on a Linux system when the
> locale is set to "zh_CN.gb18030" or "zh_HK.big5hkscs" (correct output is
> produced with "zh_CN.utf-8" and "zh_HK.utf-8"; I don't have any Japanese
> locales installed).
>
> Tom.
>
> On 4/26/21 11:52 AM, Victor Zverovich via Lib-Ext wrote:
>
> Hi Hubert,
>
> Thanks for an interesting example. I don't think it's fundamentally
> different from other cases of mixing encodings that we have already
> discussed, but I've tested it anyway:
>
> #include <chrono>
> #include <format>
> #include <iostream>
>
> int main() {
> using namespace std::literals::chrono_literals;
> std::locale::global(std::locale(".950"));
> std::cout << std::format("{:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
> }
>
> This produces the following output on Windows with 950 console code page:
>
> C:\test>test
> 15:46:09
>
> although there might be issues with other cases (as expected).
>
> Moreover, because we are looking at the case when the literal encoding is
> UTF-8, you'll currently get mojibake even in pretty basic cases:
>
> std::cout << std::format("时间 {:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
>
> Output:
>
> C:\test>test
> ?園 15:49:36
>
> Cheers,
> Victor
>
>
> On Fri, Apr 16, 2021 at 10:31 AM Hubert Tong via Lib-Ext <
> lib-ext_at_[hidden]> wrote:
>
>> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>>>
>>> The following are questions/concerns that came up during SG16 review of
>>> P2093 <https://wg21.link/p2093> that are worthy of further discussion
>>> in SG16 and/or LEWG. Most of these issues were discussed in SG16 and were
>>> determined either not to be SG16 concerns or were deemed issues that for
>>> which we did not want to hold back forward progress. These sentiments were
>>> not unanimous.
>>>
>>> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was taken
>>> during our February 10th telecon. The poll was:
>>>
>>> Poll: Forward P2093R3 to LEWG.
>>> - Attendance: 9
>>> SF
>>> F
>>> N
>>> A
>>> SA
>>> 4
>>> 2
>>> 2
>>> 0
>>> 1
>>>
>>> Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>, are
>>> available at:
>>>
>>> - December 9th, 2020 telecon
>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
>>> review of P2093R2 <https://wg21.link/p2093r2>.
>>> - February 10th, 2021 telecon
>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
>>> review of P2093R3 <https://wg21.link/p2093r3>.
>>>
>>> Questions raised include:
>>>
>>> 1. How should errors in transcoding be handled?
>>> The Unicode recommendation is to substitute a replacement character
>>> for invalid code unit sequences. P2093R4 <https://wg21.link/p2093r4>
>>> added wording to this effect.
>>> 2. Should this feature move forward without a parallel proposal to
>>> provide the underlying implementation dependent features need to implement
>>> std::print()?
>>> Specifically, should this feature be blocked on exposing interfaces
>>> to 1) determine if a stream is connected directly to a terminal/console,
>>> and 2) write directly to a terminal/console (potentially bypassing a
>>> stream) using native interfaces where applicable? These features would be
>>> necessary in order to implement a portable version of std::print().
>>> (I believe Victor is already working on a companion paper).
>>> 3. The choice to base behavior on the compile-time choice of
>>> execution character set results in locale settings being ignored at
>>> run-time. Is that ok?
>>> 1. This choice will lead to unexpected results if a program runs
>>> in a non-UTF-8 locale and consumes non-Unicode input (e.g., from stdin) and
>>> then attempts to echo it back.
>>>
>>>
>> Out of the meeting, we were asked to continue the discussion on-list (and
>> also in SG16).
>>
>> Regarding this point, the non-Unicode "input" can be the result of the
>> formatting facility.
>>
>> The conversation so far seems to indicate that the locales are not
>> constrained to use UTF-8 even in modes where the encoding used for string
>> literals is UTF-8.
>> That seems to indicate that something like:
>>
>> std::print("{:%r}\n",
>> std::chrono::system_clock::now().time_since_epoch());
>>
>> which only uses the C++ library facilities in an attempt to present a
>> localized string runs the risk of generating replacement characters.
>>
>> For example, if
>> std::locale::global(std::locale(""));
>>
>> was run when the environment had a non-UTF-8 locale.
>>
>> For example, "下午" in Big5 could end up being "�U��".
>>
>>
>>> 1. Additionally, it means that a program that uses only ASCII
>>> characters in string literals will nevertheless behave differently at
>>> run-time depending on the choice of execution character set (which
>>> historically has only affected the encoding of string literals).
>>> 1. When the execution character set is not UTF-8, should
>>> conversion to Unicode be performed when writing directly to a Unicode
>>> enabled terminal/console?
>>> 1. If so, should conversions be based on the compile-time literal
>>> encoding or the locale dependent run-time execution encoding?
>>> 2. If the latter, that creates an odd asymmetry with the behavior
>>> when the execution character set is UTF-8. Is that ok?
>>> 2. What are the implications for future support of std::print("{}
>>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")
>>> ?
>>> 1. As proposed, std::print() only produces unambiguously encoded
>>> output when the execution character set is UTF-8 and it is clear how these
>>> cases should be handled in that case.
>>> 2. But how would the behavior be defined when the execution
>>> character set is not UTF-8? Would the arguments be converted to the
>>> execution character set? Or to the locale dependent encoding?
>>> 3. Note that these concerns are relevant for std::format() as
>>> well.
>>>
>>> An additional issue that was not discussed in SG16 relates to Unicode
>>> normalization. As proposed, the expected output will match expectations if
>>> the UTF-8 text does not contain any uses of combining characters. However,
>>> if combining characters are present, either because the text is in NFD or
>>> because there is no precomposed character defined, then the combining
>>> characters may be rendered separately from their base character as a result
>>> of terminal/console interfaces mapping code points rather than grapheme
>>> clusters to columns. Should std::print() also perform NFC
>>> normalization so that characters with precomposed forms are displayed
>>> correctly? (These concerns were explored in P1868
>>> <https://wg21.link/p1868> when it was adopted for C++20; see that paper
>>> for example screenshots; in practice, this is only an issue with the
>>> Windows console).
>>>
>>> It would not be unreasonable for LEWG to send some of these questions
>>> back to SG16 for more analysis.
>>>
>>> Tom.
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>> _______________________________________________
>> Lib-Ext mailing list
>> Lib-Ext_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
>> Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18572.php
>>
>
> _______________________________________________
> Lib-Ext mailing listLib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18674.php
>
>
>

Received on 2021-04-27 08:56:50