C++ Logo

SG16

Advanced search

Subject: Re: [isocpp-lib-ext] Questions for LEWG for P2093R4: Formatted output
From: Tom Honermann (tom_at_[hidden])
Date: 2021-04-27 09:53:35


On 4/27/21 9:56 AM, Victor Zverovich wrote:
> Sure, as I wrote:
>
> >  there might be issues with other cases (as expected)
>
> This is another data point in favor of fixing locales.

No disagreement there, but that is not a trivial undertaking. The
question is, what do we do in the meantime?

I think the LWG issue that Corentin just submitted (Time formatters
should not be locale sensitive by default; only sent to the LWG chair
and SG16 mailing list) would be a good thing to resolve as a C++20 DR. 
We could then add locale support via the 'L' specifier when we're ready
to do it right.

Tom.

>
> - Victor
>
>
> On Mon, Apr 26, 2021 at 1:15 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> I don't think that test really exercises the concern Hubert
> raised.  Please try this one instead (I don't have a convenient
> environment to try this myself).  This one should print only the
> AM/PM designator for each of the locales.  You might also try the
> "zh_CN" and "zh_HK" locales.
>
> #include <chrono>
> #include <format>
> #include <iostream>
>
> int main() {
>   using namespace std::literals::chrono_literals;
>   std::locale::global(std::locale("ja_JP"));
>   std::cout << "ja_JP: " << std::format("{:%p}\n",
> std::chrono::system_clock::now().time_since_epoch());
>   std::locale::global(std::locale("ja_JP.utf8"));
>   std::cout << "ja_JP.utf8: " << std::format("{:%p}\n",
> std::chrono::system_clock::now().time_since_epoch());
> }
>
> I can, not surprisingly, produce mojibake on a Linux system when
> the locale is set to "zh_CN.gb18030" or "zh_HK.big5hkscs" (correct
> output is produced with "zh_CN.utf-8" and "zh_HK.utf-8"; I don't
> have any Japanese locales installed).
>
> Tom.
>
> On 4/26/21 11:52 AM, Victor Zverovich via Lib-Ext wrote:
>> Hi Hubert,
>>
>> Thanks for an interesting example. I don't think it's
>> fundamentally different from other cases of mixing encodings that
>> we have already discussed, but I've tested it anyway:
>>
>>   #include <chrono>
>>   #include <format>
>>   #include <iostream>
>>
>>   int main() {
>>     using namespace std::literals::chrono_literals;
>>     std::locale::global(std::locale(".950"));
>>     std::cout << std::format("{:%r}\n",
>> std::chrono::system_clock::now().time_since_epoch());
>>   }
>>
>> This produces the following output on Windows with 950 console
>> code page:
>>
>>   C:\test>test
>>   15:46:09
>>
>> although there might be issues with other cases (as expected).
>>
>> Moreover, because we are looking at the case when the literal
>> encoding is UTF-8, you'll currently get mojibake even in pretty
>> basic cases:
>>
>>   std::cout << std::format("时间 {:%r}\n",
>> std::chrono::system_clock::now().time_since_epoch());
>>
>> Output:
>>
>>   C:\test>test
>>   ?園 15:49:36
>>
>> Cheers,
>> Victor
>>
>>
>> On Fri, Apr 16, 2021 at 10:31 AM Hubert Tong via Lib-Ext
>> <lib-ext_at_[hidden] <mailto:lib-ext_at_[hidden]>> wrote:
>>
>> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>>
>> The following are questions/concerns that came up during
>> SG16 review of P2093 <https://wg21.link/p2093> that are
>> worthy of further discussion in SG16 and/or LEWG.  Most
>> of these issues were discussed in SG16 and were
>> determined either not to be SG16 concerns or were deemed
>> issues that for which we did not want to hold back
>> forward progress.  These sentiments were not unanimous.
>>
>> The SG16 poll to forward P2093R3
>> <https://wg21.link/p2093r3> was taken during our February
>> 10th telecon. The poll was:
>>
>> Poll: Forward P2093R3 to LEWG.
>> - Attendance: 9
>>
>> SF
>> F
>> N
>> A
>> SA
>> 4
>> 2
>> 2
>> 0
>> 1
>>
>> Minutes for prior SG16 reviews of P2093
>> <https://wg21.link/p2093>, are available at:
>>
>> * December 9th, 2020 telecon
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
>> review of P2093R2 <https://wg21.link/p2093r2>.
>> * February 10th, 2021 telecon
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
>> review of P2093R3 <https://wg21.link/p2093r3>.
>>
>> Questions raised include:
>>
>> 1. How should errors in transcoding be handled?
>> The Unicode recommendation is to substitute a
>> replacement character for invalid code unit
>> sequences. P2093R4 <https://wg21.link/p2093r4> added
>> wording to this effect.
>> 2. Should this feature move forward without a parallel
>> proposal to provide the underlying implementation
>> dependent features need to implement std::print()?
>> Specifically, should this feature be blocked on
>> exposing interfaces to 1) determine if a stream is
>> connected directly to a terminal/console, and 2)
>> write directly to a terminal/console (potentially
>> bypassing a stream) using native interfaces where
>> applicable?  These features would be necessary in
>> order to implement a portable version of
>> std::print(). (I believe Victor is already working on
>> a companion paper).
>> 3. The choice to base behavior on the compile-time
>> choice of execution character set results in locale
>> settings being ignored at run-time.  Is that ok?
>> 1. This choice will lead to unexpected results if a
>> program runs in a non-UTF-8 locale and consumes
>> non-Unicode input (e.g., from stdin) and then
>> attempts to echo it back.
>>
>>
>> Out of the meeting, we were asked to continue the discussion
>> on-list (and also in SG16).
>>
>> Regarding this point, the non-Unicode "input" can be the
>> result of the formatting facility.
>>
>> The conversation so far seems to indicate that the locales
>> are not constrained to use UTF-8 even in modes where the
>> encoding used for string literals is UTF-8.
>> That seems to indicate that something like:
>>
>> std::print("{:%r}\n",
>> std::chrono::system_clock::now().time_since_epoch());
>>
>> which only uses the C++ library facilities in an attempt to
>> present a localized string runs the risk of generating
>> replacement characters.
>>
>> For example, if
>> std::locale::global(std::locale(""));
>>
>> was run when the environment had a non-UTF-8 locale.
>>
>> For example, "下午" in Big5 could end up being "�U��".
>>
>> 1. Additionally, it means that a program that uses
>> only ASCII characters in string literals will
>> nevertheless behave differently at run-time
>> depending on the choice of execution character
>> set (which historically has only affected the
>> encoding of string literals).
>> 1. When the execution character set is not UTF-8, should
>> conversion to Unicode be performed when writing
>> directly to a Unicode enabled terminal/console?
>> 1. If so, should conversions be based on the
>> compile-time literal encoding or the locale
>> dependent run-time execution encoding?
>> 2. If the latter, that creates an odd asymmetry with
>> the behavior when the execution character set is
>> UTF-8.  Is that ok?
>> 2. What are the implications for future support of
>> std::print("{} {} {} {}", L"Wide text", u8"UTF-8
>> text", u"UTF-16 text", U"UTF-32 text")?
>> 1. As proposed, std::print() only produces
>> unambiguously encoded output when the execution
>> character set is UTF-8 and it is clear how these
>> cases should be handled in that case.
>> 2. But how would the behavior be defined when the
>> execution character set is not UTF-8?  Would the
>> arguments be converted to the execution character
>> set?  Or to the locale dependent encoding?
>> 3. Note that these concerns are relevant for
>> std::format() as well.
>>
>> An additional issue that was not discussed in SG16
>> relates to Unicode normalization. As proposed, the
>> expected output will match expectations if the UTF-8 text
>> does not contain any uses of combining characters.
>> However, if combining characters are present, either
>> because the text is in NFD or because there is no
>> precomposed character defined, then the combining
>> characters may be rendered separately from their base
>> character as a result of terminal/console interfaces
>> mapping code points rather than grapheme clusters to
>> columns.  Should std::print() also perform NFC
>> normalization so that characters with precomposed forms
>> are displayed correctly?  (These concerns were explored
>> in P1868 <https://wg21.link/p1868> when it was adopted
>> for C++20; see that paper for example screenshots; in
>> practice, this is only an issue with the Windows console).
>>
>> It would not be unreasonable for LEWG to send some of
>> these questions back to SG16 for more analysis.
>>
>> Tom.
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>> _______________________________________________
>> Lib-Ext mailing list
>> Lib-Ext_at_[hidden] <mailto:Lib-Ext_at_[hidden]>
>> Subscription:
>> https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
>> Link to this post:
>> http://lists.isocpp.org/lib-ext/2021/04/18572.php
>>
>>
>> _______________________________________________
>> Lib-Ext mailing list
>> Lib-Ext_at_[hidden] <mailto:Lib-Ext_at_[hidden]>
>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
>> Link to this post:http://lists.isocpp.org/lib-ext/2021/04/18674.php
>
>



SG16 list run by sg16-owner@lists.isocpp.org