sg16: Re: [SG16] [isocpp-lib-ext] Questions for LEWG for P2093R4: Formatted output

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Mon, 26 Apr 2021 08:52:26 -0700

Hi Hubert,

Thanks for an interesting example. I don't think it's fundamentally
different from other cases of mixing encodings that we have already
discussed, but I've tested it anyway:

  #include <chrono>
  #include <format>
  #include <iostream>

  int main() {
    using namespace std::literals::chrono_literals;
    std::locale::global(std::locale(".950"));
    std::cout << std::format("{:%r}\n",
std::chrono::system_clock::now().time_since_epoch());
  }

This produces the following output on Windows with 950 console code page:

  C:\test>test
  15:46:09

although there might be issues with other cases (as expected).

Moreover, because we are looking at the case when the literal encoding is
UTF-8, you'll currently get mojibake even in pretty basic cases:

  std::cout << std::format("时间 {:%r}\n",
std::chrono::system_clock::now().time_since_epoch());

Output:

  C:\test>test
  ?園 15:49:36

Cheers,
Victor

On Fri, Apr 16, 2021 at 10:31 AM Hubert Tong via Lib-Ext <
lib-ext_at_[hidden]> wrote:

> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>>
>> The following are questions/concerns that came up during SG16 review of
>> P2093 <https://wg21.link/p2093> that are worthy of further discussion in
>> SG16 and/or LEWG. Most of these issues were discussed in SG16 and were
>> determined either not to be SG16 concerns or were deemed issues that for
>> which we did not want to hold back forward progress. These sentiments were
>> not unanimous.
>>
>> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was taken
>> during our February 10th telecon. The poll was:
>>
>> Poll: Forward P2093R3 to LEWG.
>> - Attendance: 9
>> SF
>> F
>> N
>> A
>> SA
>> 4
>> 2
>> 2
>> 0
>> 1
>>
>> Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>, are
>> available at:
>>
>> - December 9th, 2020 telecon
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
>> review of P2093R2 <https://wg21.link/p2093r2>.
>> - February 10th, 2021 telecon
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
>> review of P2093R3 <https://wg21.link/p2093r3>.
>>
>> Questions raised include:
>>
>> 1. How should errors in transcoding be handled?
>> The Unicode recommendation is to substitute a replacement character
>> for invalid code unit sequences. P2093R4 <https://wg21.link/p2093r4>
>> added wording to this effect.
>> 2. Should this feature move forward without a parallel proposal to
>> provide the underlying implementation dependent features need to implement
>> std::print()?
>> Specifically, should this feature be blocked on exposing interfaces
>> to 1) determine if a stream is connected directly to a terminal/console,
>> and 2) write directly to a terminal/console (potentially bypassing a
>> stream) using native interfaces where applicable? These features would be
>> necessary in order to implement a portable version of std::print().
>> (I believe Victor is already working on a companion paper).
>> 3. The choice to base behavior on the compile-time choice of
>> execution character set results in locale settings being ignored at
>> run-time. Is that ok?
>> 1. This choice will lead to unexpected results if a program runs
>> in a non-UTF-8 locale and consumes non-Unicode input (e.g., from stdin) and
>> then attempts to echo it back.
>>
>>
> Out of the meeting, we were asked to continue the discussion on-list (and
> also in SG16).
>
> Regarding this point, the non-Unicode "input" can be the result of the
> formatting facility.
>
> The conversation so far seems to indicate that the locales are not
> constrained to use UTF-8 even in modes where the encoding used for string
> literals is UTF-8.
> That seems to indicate that something like:
>
> std::print("{:%r}\n", std::chrono::system_clock::now().time_since_epoch());
>
> which only uses the C++ library facilities in an attempt to present a
> localized string runs the risk of generating replacement characters.
>
> For example, if
> std::locale::global(std::locale(""));
>
> was run when the environment had a non-UTF-8 locale.
>
> For example, "下午" in Big5 could end up being "�U��".
>
>
>> 1. Additionally, it means that a program that uses only ASCII
>> characters in string literals will nevertheless behave differently at
>> run-time depending on the choice of execution character set (which
>> historically has only affected the encoding of string literals).
>> 1. When the execution character set is not UTF-8, should
>> conversion to Unicode be performed when writing directly to a Unicode
>> enabled terminal/console?
>> 1. If so, should conversions be based on the compile-time literal
>> encoding or the locale dependent run-time execution encoding?
>> 2. If the latter, that creates an odd asymmetry with the behavior
>> when the execution character set is UTF-8. Is that ok?
>> 2. What are the implications for future support of std::print("{}
>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")
>> ?
>> 1. As proposed, std::print() only produces unambiguously encoded
>> output when the execution character set is UTF-8 and it is clear how these
>> cases should be handled in that case.
>> 2. But how would the behavior be defined when the execution
>> character set is not UTF-8? Would the arguments be converted to the
>> execution character set? Or to the locale dependent encoding?
>> 3. Note that these concerns are relevant for std::format() as well.
>>
>> An additional issue that was not discussed in SG16 relates to Unicode
>> normalization. As proposed, the expected output will match expectations if
>> the UTF-8 text does not contain any uses of combining characters. However,
>> if combining characters are present, either because the text is in NFD or
>> because there is no precomposed character defined, then the combining
>> characters may be rendered separately from their base character as a result
>> of terminal/console interfaces mapping code points rather than grapheme
>> clusters to columns. Should std::print() also perform NFC normalization
>> so that characters with precomposed forms are displayed correctly? (These
>> concerns were explored in P1868 <https://wg21.link/p1868> when it was
>> adopted for C++20; see that paper for example screenshots; in practice,
>> this is only an issue with the Windows console).
>>
>> It would not be unreasonable for LEWG to send some of these questions
>> back to SG16 for more analysis.
>>
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18572.php
>

Received on 2021-04-26 10:52:41