Date: Mon, 26 Apr 2021 16:15:32 -0400
I don't think that test really exercises the concern Hubert raised.
Please try this one instead (I don't have a convenient environment to
try this myself). This one should print only the AM/PM designator for
each of the locales. You might also try the "zh_CN" and "zh_HK" locales.
#include <chrono>
#include <format>
#include <iostream>
int main() {
using namespace std::literals::chrono_literals;
std::locale::global(std::locale("ja_JP"));
std::cout << "ja_JP: " << std::format("{:%p}\n",
std::chrono::system_clock::now().time_since_epoch());
std::locale::global(std::locale("ja_JP.utf8"));
std::cout << "ja_JP.utf8: " << std::format("{:%p}\n",
std::chrono::system_clock::now().time_since_epoch());
}
I can, not surprisingly, produce mojibake on a Linux system when the
locale is set to "zh_CN.gb18030" or "zh_HK.big5hkscs" (correct output is
produced with "zh_CN.utf-8" and "zh_HK.utf-8"; I don't have any Japanese
locales installed).
Tom.
On 4/26/21 11:52 AM, Victor Zverovich via Lib-Ext wrote:
> Hi Hubert,
>
> Thanks for an interesting example. I don't think it's fundamentally
> different from other cases of mixing encodings that we have already
> discussed, but I've tested it anyway:
>
> #include <chrono>
> #include <format>
> #include <iostream>
>
> int main() {
> using namespace std::literals::chrono_literals;
> std::locale::global(std::locale(".950"));
> std::cout << std::format("{:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
> }
>
> This produces the following output on Windows with 950 console code page:
>
> C:\test>test
> 15:46:09
>
> although there might be issues with other cases (as expected).
>
> Moreover, because we are looking at the case when the literal encoding
> is UTF-8, you'll currently get mojibake even in pretty basic cases:
>
> std::cout << std::format("时间 {:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
>
> Output:
>
> C:\test>test
> ?園 15:49:36
>
> Cheers,
> Victor
>
>
> On Fri, Apr 16, 2021 at 10:31 AM Hubert Tong via Lib-Ext
> <lib-ext_at_[hidden] <mailto:lib-ext_at_[hidden]>> wrote:
>
> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>
> The following are questions/concerns that came up during SG16
> review of P2093 <https://wg21.link/p2093> that are worthy of
> further discussion in SG16 and/or LEWG. Most of these issues
> were discussed in SG16 and were determined either not to be
> SG16 concerns or were deemed issues that for which we did not
> want to hold back forward progress. These sentiments were not
> unanimous.
>
> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3>
> was taken during our February 10th telecon. The poll was:
>
> Poll: Forward P2093R3 to LEWG.
> - Attendance: 9
>
> SF
> F
> N
> A
> SA
> 4
> 2
> 2
> 0
> 1
>
> Minutes for prior SG16 reviews of P2093
> <https://wg21.link/p2093>, are available at:
>
> * December 9th, 2020 telecon
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
> review of P2093R2 <https://wg21.link/p2093r2>.
> * February 10th, 2021 telecon
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
> review of P2093R3 <https://wg21.link/p2093r3>.
>
> Questions raised include:
>
> 1. How should errors in transcoding be handled?
> The Unicode recommendation is to substitute a replacement
> character for invalid code unit sequences. P2093R4
> <https://wg21.link/p2093r4> added wording to this effect.
> 2. Should this feature move forward without a parallel
> proposal to provide the underlying implementation
> dependent features need to implement std::print()?
> Specifically, should this feature be blocked on exposing
> interfaces to 1) determine if a stream is connected
> directly to a terminal/console, and 2) write directly to a
> terminal/console (potentially bypassing a stream) using
> native interfaces where applicable? These features would
> be necessary in order to implement a portable version of
> std::print(). (I believe Victor is already working on a
> companion paper).
> 3. The choice to base behavior on the compile-time choice of
> execution character set results in locale settings being
> ignored at run-time. Is that ok?
> 1. This choice will lead to unexpected results if a
> program runs in a non-UTF-8 locale and consumes
> non-Unicode input (e.g., from stdin) and then attempts
> to echo it back.
>
>
> Out of the meeting, we were asked to continue the discussion
> on-list (and also in SG16).
>
> Regarding this point, the non-Unicode "input" can be the result of
> the formatting facility.
>
> The conversation so far seems to indicate that the locales are not
> constrained to use UTF-8 even in modes where the encoding used for
> string literals is UTF-8.
> That seems to indicate that something like:
>
> std::print("{:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
>
> which only uses the C++ library facilities in an attempt to
> present a localized string runs the risk of generating replacement
> characters.
>
> For example, if
> std::locale::global(std::locale(""));
>
> was run when the environment had a non-UTF-8 locale.
>
> For example, "下午" in Big5 could end up being "�U��".
>
> 1. Additionally, it means that a program that uses only
> ASCII characters in string literals will nevertheless
> behave differently at run-time depending on the choice
> of execution character set (which historically has
> only affected the encoding of string literals).
> 1. When the execution character set is not UTF-8, should
> conversion to Unicode be performed when writing directly
> to a Unicode enabled terminal/console?
> 1. If so, should conversions be based on the compile-time
> literal encoding or the locale dependent run-time
> execution encoding?
> 2. If the latter, that creates an odd asymmetry with the
> behavior when the execution character set is UTF-8.
> Is that ok?
> 2. What are the implications for future support of
> std::print("{} {} {} {}", L"Wide text", u8"UTF-8 text",
> u"UTF-16 text", U"UTF-32 text")?
> 1. As proposed, std::print() only produces unambiguously
> encoded output when the execution character set is
> UTF-8 and it is clear how these cases should be
> handled in that case.
> 2. But how would the behavior be defined when the
> execution character set is not UTF-8? Would the
> arguments be converted to the execution character
> set? Or to the locale dependent encoding?
> 3. Note that these concerns are relevant for
> std::format() as well.
>
> An additional issue that was not discussed in SG16 relates to
> Unicode normalization. As proposed, the expected output will
> match expectations if the UTF-8 text does not contain any uses
> of combining characters. However, if combining characters are
> present, either because the text is in NFD or because there is
> no precomposed character defined, then the combining
> characters may be rendered separately from their base
> character as a result of terminal/console interfaces mapping
> code points rather than grapheme clusters to columns. Should
> std::print() also perform NFC normalization so that characters
> with precomposed forms are displayed correctly? (These
> concerns were explored in P1868 <https://wg21.link/p1868> when
> it was adopted for C++20; see that paper for example
> screenshots; in practice, this is only an issue with the
> Windows console).
>
> It would not be unreasonable for LEWG to send some of these
> questions back to SG16 for more analysis.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden] <mailto:Lib-Ext_at_[hidden]>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18572.php
>
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18674.php
Please try this one instead (I don't have a convenient environment to
try this myself). This one should print only the AM/PM designator for
each of the locales. You might also try the "zh_CN" and "zh_HK" locales.
#include <chrono>
#include <format>
#include <iostream>
int main() {
using namespace std::literals::chrono_literals;
std::locale::global(std::locale("ja_JP"));
std::cout << "ja_JP: " << std::format("{:%p}\n",
std::chrono::system_clock::now().time_since_epoch());
std::locale::global(std::locale("ja_JP.utf8"));
std::cout << "ja_JP.utf8: " << std::format("{:%p}\n",
std::chrono::system_clock::now().time_since_epoch());
}
I can, not surprisingly, produce mojibake on a Linux system when the
locale is set to "zh_CN.gb18030" or "zh_HK.big5hkscs" (correct output is
produced with "zh_CN.utf-8" and "zh_HK.utf-8"; I don't have any Japanese
locales installed).
Tom.
On 4/26/21 11:52 AM, Victor Zverovich via Lib-Ext wrote:
> Hi Hubert,
>
> Thanks for an interesting example. I don't think it's fundamentally
> different from other cases of mixing encodings that we have already
> discussed, but I've tested it anyway:
>
> #include <chrono>
> #include <format>
> #include <iostream>
>
> int main() {
> using namespace std::literals::chrono_literals;
> std::locale::global(std::locale(".950"));
> std::cout << std::format("{:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
> }
>
> This produces the following output on Windows with 950 console code page:
>
> C:\test>test
> 15:46:09
>
> although there might be issues with other cases (as expected).
>
> Moreover, because we are looking at the case when the literal encoding
> is UTF-8, you'll currently get mojibake even in pretty basic cases:
>
> std::cout << std::format("时间 {:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
>
> Output:
>
> C:\test>test
> ?園 15:49:36
>
> Cheers,
> Victor
>
>
> On Fri, Apr 16, 2021 at 10:31 AM Hubert Tong via Lib-Ext
> <lib-ext_at_[hidden] <mailto:lib-ext_at_[hidden]>> wrote:
>
> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>
> The following are questions/concerns that came up during SG16
> review of P2093 <https://wg21.link/p2093> that are worthy of
> further discussion in SG16 and/or LEWG. Most of these issues
> were discussed in SG16 and were determined either not to be
> SG16 concerns or were deemed issues that for which we did not
> want to hold back forward progress. These sentiments were not
> unanimous.
>
> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3>
> was taken during our February 10th telecon. The poll was:
>
> Poll: Forward P2093R3 to LEWG.
> - Attendance: 9
>
> SF
> F
> N
> A
> SA
> 4
> 2
> 2
> 0
> 1
>
> Minutes for prior SG16 reviews of P2093
> <https://wg21.link/p2093>, are available at:
>
> * December 9th, 2020 telecon
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
> review of P2093R2 <https://wg21.link/p2093r2>.
> * February 10th, 2021 telecon
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
> review of P2093R3 <https://wg21.link/p2093r3>.
>
> Questions raised include:
>
> 1. How should errors in transcoding be handled?
> The Unicode recommendation is to substitute a replacement
> character for invalid code unit sequences. P2093R4
> <https://wg21.link/p2093r4> added wording to this effect.
> 2. Should this feature move forward without a parallel
> proposal to provide the underlying implementation
> dependent features need to implement std::print()?
> Specifically, should this feature be blocked on exposing
> interfaces to 1) determine if a stream is connected
> directly to a terminal/console, and 2) write directly to a
> terminal/console (potentially bypassing a stream) using
> native interfaces where applicable? These features would
> be necessary in order to implement a portable version of
> std::print(). (I believe Victor is already working on a
> companion paper).
> 3. The choice to base behavior on the compile-time choice of
> execution character set results in locale settings being
> ignored at run-time. Is that ok?
> 1. This choice will lead to unexpected results if a
> program runs in a non-UTF-8 locale and consumes
> non-Unicode input (e.g., from stdin) and then attempts
> to echo it back.
>
>
> Out of the meeting, we were asked to continue the discussion
> on-list (and also in SG16).
>
> Regarding this point, the non-Unicode "input" can be the result of
> the formatting facility.
>
> The conversation so far seems to indicate that the locales are not
> constrained to use UTF-8 even in modes where the encoding used for
> string literals is UTF-8.
> That seems to indicate that something like:
>
> std::print("{:%r}\n",
> std::chrono::system_clock::now().time_since_epoch());
>
> which only uses the C++ library facilities in an attempt to
> present a localized string runs the risk of generating replacement
> characters.
>
> For example, if
> std::locale::global(std::locale(""));
>
> was run when the environment had a non-UTF-8 locale.
>
> For example, "下午" in Big5 could end up being "�U��".
>
> 1. Additionally, it means that a program that uses only
> ASCII characters in string literals will nevertheless
> behave differently at run-time depending on the choice
> of execution character set (which historically has
> only affected the encoding of string literals).
> 1. When the execution character set is not UTF-8, should
> conversion to Unicode be performed when writing directly
> to a Unicode enabled terminal/console?
> 1. If so, should conversions be based on the compile-time
> literal encoding or the locale dependent run-time
> execution encoding?
> 2. If the latter, that creates an odd asymmetry with the
> behavior when the execution character set is UTF-8.
> Is that ok?
> 2. What are the implications for future support of
> std::print("{} {} {} {}", L"Wide text", u8"UTF-8 text",
> u"UTF-16 text", U"UTF-32 text")?
> 1. As proposed, std::print() only produces unambiguously
> encoded output when the execution character set is
> UTF-8 and it is clear how these cases should be
> handled in that case.
> 2. But how would the behavior be defined when the
> execution character set is not UTF-8? Would the
> arguments be converted to the execution character
> set? Or to the locale dependent encoding?
> 3. Note that these concerns are relevant for
> std::format() as well.
>
> An additional issue that was not discussed in SG16 relates to
> Unicode normalization. As proposed, the expected output will
> match expectations if the UTF-8 text does not contain any uses
> of combining characters. However, if combining characters are
> present, either because the text is in NFD or because there is
> no precomposed character defined, then the combining
> characters may be rendered separately from their base
> character as a result of terminal/console interfaces mapping
> code points rather than grapheme clusters to columns. Should
> std::print() also perform NFC normalization so that characters
> with precomposed forms are displayed correctly? (These
> concerns were explored in P1868 <https://wg21.link/p1868> when
> it was adopted for C++20; see that paper for example
> screenshots; in practice, this is only an issue with the
> Windows console).
>
> It would not be unreasonable for LEWG to send some of these
> questions back to SG16 for more analysis.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden] <mailto:Lib-Ext_at_[hidden]>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18572.php
>
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/04/18674.php
Received on 2021-04-26 15:15:38