Date: Tue, 11 May 2021 23:23:00 -0400
On Tue, May 11, 2021 at 8:41 PM Victor Zverovich via SG16 <
sg16_at_[hidden]> wrote:
> Dear Unicoders,
>
> Here is a link to a new revision of P2093:
> https://isocpp.org/files/papers/D2093R6.html. It's essentially the same
> as R5 but addresses the latest LEWG feedback and adds a few clarifications.
> The only change to the wording is replacing <io> with <print>.
>
Thanks Victor.
With respect to the choice to transcoding, it took me a while to catch on
to the statement being made. I think it would help if the point was stated
more explicitly that the choice to perform replacement during transcoding
is because that is consistent with the treatment of malformed UTF-8 for
UTF-8-native terminals and the choice not to transcode in the case where
the terminal is UTF-8 native is because we expect the terminal to behave
predictably as-is we did do the "transcoding".
I'm still not entirely convinced about the argument surrounding the choice
of using the literal encoding though. The paper can at least acknowledge
that "polyglot" string literals exist and partially obviates the insistence
that the literal encoding being UTF-8 according to the build system/build
mode means that the involvement of non-UTF-8 strings in the vicinity of
std::print constitutes "mixing encodings".
I really think that, just for predictability surrounding the display of
substitution text, we'll end up with cases where the literal encoding is
UTF-8 but the user won't want the UTF-8 std::print behaviour to potentially
kick in.
At least two cases come to mind:
(1) Printing using both legacy interfaces and std::print where the legacy
interfaces are not using UTF-8 may appear fine on some terminals but would
result, on redirect, in output with mixed encoding.
(2) std::print where the literal encoding is UTF-8 but the literals are all
"polyglot" and substitution strings that are not UTF-8 can appear to be
okay when redirecting or printing to non-Unicode terminals; however, once
deployed to a Unicode terminal, replacement characters show up (even if the
output is properly encoded for the underlying C output interface).
>
> Cheers,
> Victor
>
> On Tue, May 11, 2021 at 11:02 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Reminder that this meeting is taking place tomorrow.
>>
>> Per suggestion by Peter, the agenda order is being changed to review the
>> updates in P2295R2 before D2372R1 and P2093R5 in the hopes that we can
>> forward P2295R2 to EWG. We'll try to limit that discussion to 30 minutes.
>> The updated agenda is below. Again, we are unlikely to get to P2348R0 at
>> all.
>>
>> - P2295R2: Support for UTF-8 as a portable source file encoding
>> <https://wg21.link/p2295r3>
>> - Review updates intended to address prior SG16 feedback.
>> - D2372R1: Fixing locale handling in chrono formatters
>> <https://isocpp.org/files/papers/D2372R1.html>
>> - Affirm or rebut LEWGs position.
>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>> - Discuss locale dependent character encoding concerns.
>> - P2348R0: Whitespaces Wording Revamp <https://wg21.link/p2348r0>
>>
>> Tom.
>>
>> On 5/4/21 12:06 AM, Tom Honermann via SG16 wrote:
>>
>> SG16 will hold a telecon on Wednesday, May 12th at 19:30 UTC (timezone
>> conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210512T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
>> ).
>>
>> The agenda is:
>>
>> - D2372R1: Fixing locale handling in chrono formatters
>> <https://isocpp.org/files/papers/D2372R1.html>
>> - Affirm or rebut LEWGs position.
>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>> - Discuss locale dependent character encoding concerns.
>> - P2295R2: Support for UTF-8 as a portable source file encoding
>> <https://wg21.link/p2295r3>
>> - Review updates intended to address prior SG16 feedback.
>> - P2348R0: Whitespaces Wording Revamp <https://wg21.link/p2348r0>
>>
>> Our last telecon was consumed by discussion
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#april-28th-2021>
>> of LWG3547 <https://cplusplus.github.io/LWG/issue3547> and possible
>> remedies. Though we did not reach consensus on a direction forward during
>> that telecon, Victor and Corentin, at the LEWG chair's request, drafted
>> D2372R0, presented it at the LEWG telecon held 2021-05-03
>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2372#2021-05-03>, and
>> LEWG reached strong consensus for it. The D2372R0 revision will be
>> submitted for the May mailing as P2372R0; and a D2372R1
>> <https://isocpp.org/files/papers/D2372R1.html> revision addressing LEWG
>> feedback will be submitted as P2372R1. Both revisions substantially match
>> the proposed resolution that SG16 discussed. Since SG16 did not reach
>> consensus on that direction, the LEWG chair has asked that we revisit it to
>> either affirm or rebut the LEWG consensus. We will therefore (briefly)
>> discuss and then poll that direction. Note that the poll taken in SG16
>> differs from the poll taken in LEWG. In SG16, we polled applying the
>> proposed resolution to C++23 while LEWG polled applying the proposed
>> resolution (with amendments to not change behavior for iostream
>> manipulators) to C++23 *and* retroactively to C++20.
>>
>> Once we've dispatched D2372R1, we'll return to the original agenda for
>> our last telecon; discussion of P2093R5 <https://wg21.link/p2093r5>
>> (Formatted output) and P2295R2 <https://wg21.link/p2295r3> (Support for
>> UTF-8 as a portable source file encoding). I've retained P2348R0
>> <https://wg21.link/p2348r0> on the agenda, though I don't expect that
>> we'll get to it.
>>
>> With regard to P2093R5 <https://wg21.link/p2093r5>, the current status
>> is that LEWG has referred the paper back to SG16 for further discussion;
>> please see the LEWG meeting minutes here
>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
>> Specifically, LEWG would benefit from additional analysis of previously
>> deferred questions <http://lists.isocpp.org/lib-ext/2021/03/18189.php>
>> regarding character encoding concerns, transcoding requirements (or the
>> lack there of) and the ensuing consequences (or lack there of).
>>
>> 1. How errors in transcoding should be handled. E.g., when
>> transcoding from UTF-8 to a UTF-16 based console interface and the UTF-8
>> input is not well-formed.
>> 2. The choice to base behavior on the compile-time choice of literal
>> encoding. An implication of the current proposal is that a program that
>> contains only ASCII characters in string literals will change behavior
>> depending on whether the literal encoding is UTF-8 vs ASCII (or some other
>> ASCII derived encoding).
>> 3. Whether transcoding to the console interface encoding should be
>> performed when the literal encoding is not UTF-8.
>> 4. What the implications are for future support of std::print("{} {}
>> {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text").
>>
>> I think these concerns will be easier to resolve if we first reach
>> consensus regarding scenarios in which localized text may be provided in an
>> unexpected encoding. The following is a slightly modified example of code
>> Hubert previously provided. The example has been modified to explicitly
>> opt into localized chrono formatting.
>>
>> std::print("{:L%p}\n",
>> std::chrono::system_clock::now().time_since_epoch());
>>
>> At issue is the encoding used by locale sensitive chrono formatters. The
>> example above contains the %p specifier and is locale sensitive because
>> AM/PM designations may be localized. In a Chinese locale the desired
>> translation of "PM" is "下午", but the locale will provide the translation in
>> the locale encoding. As specified in P2093R5, if the literal encoding is
>> UTF-8, than std::print() will expect the translation to be provided in
>> UTF-8, but if the locale is not UTF-8-based (e.g., Big5; perhaps Shift-JIS
>> for the Japanese 午後 translation), then the result is mojibake.
>>
>> I had previously suggested the following possible directions we can
>> investigate to resolve the encoding concerns.
>>
>> - Specialize std::locale facets
>> <https://en.cppreference.com/w/cpp/locale/locale> and related I/O
>> manipulators like std::put_time()
>> <https://en.cppreference.com/w/cpp/io/manip/put_time> for char8_t.
>> This would allow std::print() to, when the literal encoding is UTF-8,
>> opt-in to use of the UTF-8/char8_t facets and I/O manipulators.
>> - When the literal encoding is UTF-8, stipulate that running the
>> program in a non-UTF-8 based locale is non-conforming. This would
>> effectively require MSVC programmers to, when building code with the
>> /utf-8 option, to also force selection of a UTF-8 code page via a
>> manifest
>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>> and require use of Windows 10 build 1903 or later.
>> - When the literal encoding is UTF-8, specify that non-UTF-8 based
>> locale dependent translations be implicitly transcoded (such transcoding
>> should never result in errors except perhaps for memory allocation
>> failures).
>> - Drop the special case handling for the literal encoding being UTF-8
>> and specify that, when bypassing a stream to write directly to the console,
>> that the output be implicitly transcoded from the current locale dependent
>> encoding (whatever it is) to the console encoding (UTF-8).
>>
>> If we get through all of that, we'll review Corentin's updates in P2295R2
>> <https://wg21.link/p2295r3> to address prior SG16 feedback. Thank you
>> to everyone that already provided additional feedback on the mailing list!
>>
>> Tom.
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
sg16_at_[hidden]> wrote:
> Dear Unicoders,
>
> Here is a link to a new revision of P2093:
> https://isocpp.org/files/papers/D2093R6.html. It's essentially the same
> as R5 but addresses the latest LEWG feedback and adds a few clarifications.
> The only change to the wording is replacing <io> with <print>.
>
Thanks Victor.
With respect to the choice to transcoding, it took me a while to catch on
to the statement being made. I think it would help if the point was stated
more explicitly that the choice to perform replacement during transcoding
is because that is consistent with the treatment of malformed UTF-8 for
UTF-8-native terminals and the choice not to transcode in the case where
the terminal is UTF-8 native is because we expect the terminal to behave
predictably as-is we did do the "transcoding".
I'm still not entirely convinced about the argument surrounding the choice
of using the literal encoding though. The paper can at least acknowledge
that "polyglot" string literals exist and partially obviates the insistence
that the literal encoding being UTF-8 according to the build system/build
mode means that the involvement of non-UTF-8 strings in the vicinity of
std::print constitutes "mixing encodings".
I really think that, just for predictability surrounding the display of
substitution text, we'll end up with cases where the literal encoding is
UTF-8 but the user won't want the UTF-8 std::print behaviour to potentially
kick in.
At least two cases come to mind:
(1) Printing using both legacy interfaces and std::print where the legacy
interfaces are not using UTF-8 may appear fine on some terminals but would
result, on redirect, in output with mixed encoding.
(2) std::print where the literal encoding is UTF-8 but the literals are all
"polyglot" and substitution strings that are not UTF-8 can appear to be
okay when redirecting or printing to non-Unicode terminals; however, once
deployed to a Unicode terminal, replacement characters show up (even if the
output is properly encoded for the underlying C output interface).
>
> Cheers,
> Victor
>
> On Tue, May 11, 2021 at 11:02 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Reminder that this meeting is taking place tomorrow.
>>
>> Per suggestion by Peter, the agenda order is being changed to review the
>> updates in P2295R2 before D2372R1 and P2093R5 in the hopes that we can
>> forward P2295R2 to EWG. We'll try to limit that discussion to 30 minutes.
>> The updated agenda is below. Again, we are unlikely to get to P2348R0 at
>> all.
>>
>> - P2295R2: Support for UTF-8 as a portable source file encoding
>> <https://wg21.link/p2295r3>
>> - Review updates intended to address prior SG16 feedback.
>> - D2372R1: Fixing locale handling in chrono formatters
>> <https://isocpp.org/files/papers/D2372R1.html>
>> - Affirm or rebut LEWGs position.
>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>> - Discuss locale dependent character encoding concerns.
>> - P2348R0: Whitespaces Wording Revamp <https://wg21.link/p2348r0>
>>
>> Tom.
>>
>> On 5/4/21 12:06 AM, Tom Honermann via SG16 wrote:
>>
>> SG16 will hold a telecon on Wednesday, May 12th at 19:30 UTC (timezone
>> conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210512T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
>> ).
>>
>> The agenda is:
>>
>> - D2372R1: Fixing locale handling in chrono formatters
>> <https://isocpp.org/files/papers/D2372R1.html>
>> - Affirm or rebut LEWGs position.
>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>> - Discuss locale dependent character encoding concerns.
>> - P2295R2: Support for UTF-8 as a portable source file encoding
>> <https://wg21.link/p2295r3>
>> - Review updates intended to address prior SG16 feedback.
>> - P2348R0: Whitespaces Wording Revamp <https://wg21.link/p2348r0>
>>
>> Our last telecon was consumed by discussion
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#april-28th-2021>
>> of LWG3547 <https://cplusplus.github.io/LWG/issue3547> and possible
>> remedies. Though we did not reach consensus on a direction forward during
>> that telecon, Victor and Corentin, at the LEWG chair's request, drafted
>> D2372R0, presented it at the LEWG telecon held 2021-05-03
>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2372#2021-05-03>, and
>> LEWG reached strong consensus for it. The D2372R0 revision will be
>> submitted for the May mailing as P2372R0; and a D2372R1
>> <https://isocpp.org/files/papers/D2372R1.html> revision addressing LEWG
>> feedback will be submitted as P2372R1. Both revisions substantially match
>> the proposed resolution that SG16 discussed. Since SG16 did not reach
>> consensus on that direction, the LEWG chair has asked that we revisit it to
>> either affirm or rebut the LEWG consensus. We will therefore (briefly)
>> discuss and then poll that direction. Note that the poll taken in SG16
>> differs from the poll taken in LEWG. In SG16, we polled applying the
>> proposed resolution to C++23 while LEWG polled applying the proposed
>> resolution (with amendments to not change behavior for iostream
>> manipulators) to C++23 *and* retroactively to C++20.
>>
>> Once we've dispatched D2372R1, we'll return to the original agenda for
>> our last telecon; discussion of P2093R5 <https://wg21.link/p2093r5>
>> (Formatted output) and P2295R2 <https://wg21.link/p2295r3> (Support for
>> UTF-8 as a portable source file encoding). I've retained P2348R0
>> <https://wg21.link/p2348r0> on the agenda, though I don't expect that
>> we'll get to it.
>>
>> With regard to P2093R5 <https://wg21.link/p2093r5>, the current status
>> is that LEWG has referred the paper back to SG16 for further discussion;
>> please see the LEWG meeting minutes here
>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
>> Specifically, LEWG would benefit from additional analysis of previously
>> deferred questions <http://lists.isocpp.org/lib-ext/2021/03/18189.php>
>> regarding character encoding concerns, transcoding requirements (or the
>> lack there of) and the ensuing consequences (or lack there of).
>>
>> 1. How errors in transcoding should be handled. E.g., when
>> transcoding from UTF-8 to a UTF-16 based console interface and the UTF-8
>> input is not well-formed.
>> 2. The choice to base behavior on the compile-time choice of literal
>> encoding. An implication of the current proposal is that a program that
>> contains only ASCII characters in string literals will change behavior
>> depending on whether the literal encoding is UTF-8 vs ASCII (or some other
>> ASCII derived encoding).
>> 3. Whether transcoding to the console interface encoding should be
>> performed when the literal encoding is not UTF-8.
>> 4. What the implications are for future support of std::print("{} {}
>> {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text").
>>
>> I think these concerns will be easier to resolve if we first reach
>> consensus regarding scenarios in which localized text may be provided in an
>> unexpected encoding. The following is a slightly modified example of code
>> Hubert previously provided. The example has been modified to explicitly
>> opt into localized chrono formatting.
>>
>> std::print("{:L%p}\n",
>> std::chrono::system_clock::now().time_since_epoch());
>>
>> At issue is the encoding used by locale sensitive chrono formatters. The
>> example above contains the %p specifier and is locale sensitive because
>> AM/PM designations may be localized. In a Chinese locale the desired
>> translation of "PM" is "下午", but the locale will provide the translation in
>> the locale encoding. As specified in P2093R5, if the literal encoding is
>> UTF-8, than std::print() will expect the translation to be provided in
>> UTF-8, but if the locale is not UTF-8-based (e.g., Big5; perhaps Shift-JIS
>> for the Japanese 午後 translation), then the result is mojibake.
>>
>> I had previously suggested the following possible directions we can
>> investigate to resolve the encoding concerns.
>>
>> - Specialize std::locale facets
>> <https://en.cppreference.com/w/cpp/locale/locale> and related I/O
>> manipulators like std::put_time()
>> <https://en.cppreference.com/w/cpp/io/manip/put_time> for char8_t.
>> This would allow std::print() to, when the literal encoding is UTF-8,
>> opt-in to use of the UTF-8/char8_t facets and I/O manipulators.
>> - When the literal encoding is UTF-8, stipulate that running the
>> program in a non-UTF-8 based locale is non-conforming. This would
>> effectively require MSVC programmers to, when building code with the
>> /utf-8 option, to also force selection of a UTF-8 code page via a
>> manifest
>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>> and require use of Windows 10 build 1903 or later.
>> - When the literal encoding is UTF-8, specify that non-UTF-8 based
>> locale dependent translations be implicitly transcoded (such transcoding
>> should never result in errors except perhaps for memory allocation
>> failures).
>> - Drop the special case handling for the literal encoding being UTF-8
>> and specify that, when bypassing a stream to write directly to the console,
>> that the output be implicitly transcoded from the current locale dependent
>> encoding (whatever it is) to the console encoding (UTF-8).
>>
>> If we get through all of that, we'll review Corentin's updates in P2295R2
>> <https://wg21.link/p2295r3> to address prior SG16 feedback. Thank you
>> to everyone that already provided additional feedback on the mailing list!
>>
>> Tom.
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2021-05-11 22:23:30