On Mon, Apr 26, 2021 at 1:15 PM Tom Honermann <tom@honermann.net> wrote:

I don't think that test really exercises the concern Hubert raised. Please try this one instead (I don't have a convenient environment to try this myself). This one should print only the AM/PM designator for each of the locales. You might also try the "zh_CN" and "zh_HK" locales.

#include <chrono>#include <format>#include <iostream>int main() { using namespace std::literals::chrono_literals;
std::locale::global(std::locale("ja_JP")); std::cout << "ja_JP: " << std::format("{:%p}\n", std::chrono::system_clock::now().time_since_epoch());

std::locale::global(std::locale("ja_JP.utf8")); std::cout << "ja_JP.utf8: " << std::format("{:%p}\n", std::chrono::system_clock::now().time_since_epoch());}

I can, not surprisingly, produce mojibake on a Linux system when the locale is set to "zh_CN.gb18030" or "zh_HK.big5hkscs" (correct output is produced with "zh_CN.utf-8" and "zh_HK.utf-8"; I don't have any Japanese locales installed).

Tom.

On 4/26/21 11:52 AM, Victor Zverovich via Lib-Ext wrote:
Hi Hubert,

Thanks for an interesting example. I don't think it's fundamentally different from other cases of mixing encodings that we have already discussed, but I've tested it anyway:

#include <chrono>
#include <format>
#include <iostream>

int main() {
using namespace std::literals::chrono_literals;
std::locale::global(std::locale(".950"));
std::cout << std::format("{:%r}\n", std::chrono::system_clock::now().time_since_epoch());
}

This produces the following output on Windows with 950 console code page:

C:\test>test
15:46:09

although there might be issues with other cases (as expected).

Moreover, because we are looking at the case when the literal encoding is UTF-8, you'll currently get mojibake even in pretty basic cases:

std::cout << std::format("时间 {:%r}\n", std::chrono::system_clock::now().time_since_epoch());

Output:

C:\test>test
?園 15:49:36

Cheers,

Victor

On Fri, Apr 16, 2021 at 10:31 AM Hubert Tong via Lib-Ext <lib-ext@lists.isocpp.org> wrote:

On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");

The following are questions/concerns that came up during SG16 review of P2093 that are worthy of further discussion in SG16 and/or LEWG. Most of these issues were discussed in SG16 and were determined either not to be SG16 concerns or were deemed issues that for which we did not want to hold back forward progress. These sentiments were not unanimous.

The SG16 poll to forward P2093R3 was taken during our February 10th telecon. The poll was:

Poll: Forward P2093R3 to LEWG.
- Attendance: 9

SF
F
N
A
SA

4
2
2
0
1

Minutes for prior SG16 reviews of P2093, are available at:

December 9th, 2020 telecon; review of P2093R2.

February 10th, 2021 telecon; review of P2093R3.

Questions raised include:

How should errors in transcoding be handled?
The Unicode recommendation is to substitute a replacement character for invalid code unit sequences. P2093R4 added wording to this effect.

Should this feature move forward without a parallel proposal to provide the underlying implementation dependent features need to implement std::print()?
Specifically, should this feature be blocked on exposing interfaces to 1) determine if a stream is connected directly to a terminal/console, and 2) write directly to a terminal/console (potentially bypassing a stream) using native interfaces where applicable? These features would be necessary in order to implement a portable version of std::print(). (I believe Victor is already working on a companion paper).

The choice to base behavior on the compile-time choice of execution character set results in locale settings being ignored at run-time. Is that ok?

This choice will lead to unexpected results if a program runs in a non-UTF-8 locale and consumes non-Unicode input (e.g., from stdin) and then attempts to echo it back.

Out of the meeting, we were asked to continue the discussion on-list (and also in SG16).

Regarding this point, the non-Unicode "input" can be the result of the formatting facility.

The conversation so far seems to indicate that the locales are not constrained to use UTF-8 even in modes where the encoding used for string literals is UTF-8.
That seems to indicate that something like:

std::print("{:%r}\n", std::chrono::system_clock::now().time_since_epoch());

which only uses the C++ library facilities in an attempt to present a localized string runs the risk of generating replacement characters.

For example, if
std::locale::global(std::locale(""));

was run when the environment had a non-UTF-8 locale.

For example, "下午" in Big5 could end up being "�U��".

Additionally, it means that a program that uses only ASCII characters in string literals will nevertheless behave differently at run-time depending on the choice of execution character set (which historically has only affected the encoding of string literals).

When the execution character set is not UTF-8, should conversion to Unicode be performed when writing directly to a Unicode enabled terminal/console?

If so, should conversions be based on the compile-time literal encoding or the locale dependent run-time execution encoding?

If the latter, that creates an odd asymmetry with the behavior when the execution character set is UTF-8. Is that ok?

What are the implications for future support of std::print("{} {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")?

As proposed, std::print() only produces unambiguously encoded output when the execution character set is UTF-8 and it is clear how these cases should be handled in that case.

But how would the behavior be defined when the execution character set is not UTF-8? Would the arguments be converted to the execution character set? Or to the locale dependent encoding?

Note that these concerns are relevant for std::format()as well.

An additional issue that was not discussed in SG16 relates to Unicode normalization. As proposed, the expected output will match expectations if the UTF-8 text does not contain any uses of combining characters. However, if combining characters are present, either because the text is in NFD or because there is no precomposed character defined, then the combining characters may be rendered separately from their base character as a result of terminal/console interfaces mapping code points rather than grapheme clusters to columns. Should std::print() also perform NFC normalization so that characters with precomposed forms are displayed correctly? (These concerns were explored in P1868 when it was adopted for C++20; see that paper for example screenshots; in practice, this is only an issue with the Windows console).

It would not be unreasonable for LEWG to send some of these questions back to SG16 for more analysis.

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

_______________________________________________
Lib-Ext mailing list
Lib-Ext@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18572.php
_______________________________________________
Lib-Ext mailing list
Lib-Ext@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18674.php