C++ Logo

sg16

Advanced search

[SG16] Agenda for the 2021-05-12 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 4 May 2021 00:06:16 -0400
SG16 will hold a telecon on Wednesday, May 12th at 19:30 UTC (timezone
conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20210512T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).

The agenda is:

  * D2372R1: Fixing locale handling in chrono formatters
    <https://isocpp.org/files/papers/D2372R1.html>
      o Affirm or rebut LEWGs position.
  * P2093R5: Formatted output <https://wg21.link/p2093r5>
      o Discuss locale dependent character encoding concerns.
  * P2295R2: Support for UTF-8 as a portable source file encoding
    <https://wg21.link/p2295r3>
      o Review updates intended to address prior SG16 feedback.
  * P2348R0: Whitespaces Wording Revamp <https://wg21.link/p2348r0>

Our last telecon was consumed by discussion
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#april-28th-2021>
of LWG3547 <https://cplusplus.github.io/LWG/issue3547> and possible
remedies. Though we did not reach consensus on a direction forward
during that telecon, Victor and Corentin, at the LEWG chair's request,
drafted D2372R0, presented it at the LEWG telecon held 2021-05-03
<https://wiki.edg.com/bin/view/Wg21telecons2021/P2372#2021-05-03>, and
LEWG reached strong consensus for it. The D2372R0 revision will be
submitted for the May mailing as P2372R0; and a D2372R1
<https://isocpp.org/files/papers/D2372R1.html> revision addressing LEWG
feedback will be submitted as P2372R1. Both revisions substantially
match the proposed resolution that SG16 discussed. Since SG16 did not
reach consensus on that direction, the LEWG chair has asked that we
revisit it to either affirm or rebut the LEWG consensus. We will
therefore (briefly) discuss and then poll that direction. Note that the
poll taken in SG16 differs from the poll taken in LEWG. In SG16, we
polled applying the proposed resolution to C++23 while LEWG polled
applying the proposed resolution (with amendments to not change behavior
for iostream manipulators) to C++23 *and* retroactively to C++20.

Once we've dispatched D2372R1, we'll return to the original agenda for
our last telecon; discussion of P2093R5 <https://wg21.link/p2093r5>
(Formatted output) and P2295R2 <https://wg21.link/p2295r3> (Support for
UTF-8 as a portable source file encoding). I've retained P2348R0
<https://wg21.link/p2348r0> on the agenda, though I don't expect that
we'll get to it.

With regard to P2093R5 <https://wg21.link/p2093r5>, the current status
is that LEWG has referred the paper back to SG16 for further discussion;
please see the LEWG meeting minutes here
<https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
Specifically, LEWG would benefit from additional analysis of previously
deferred questions <http://lists.isocpp.org/lib-ext/2021/03/18189.php>
regarding character encoding concerns, transcoding requirements (or the
lack there of) and the ensuing consequences (or lack there of).

 1. How errors in transcoding should be handled. E.g., when transcoding
    from UTF-8 to a UTF-16 based console interface and the UTF-8 input
    is not well-formed.
 2. The choice to base behavior on the compile-time choice of literal
    encoding. An implication of the current proposal is that a program
    that contains only ASCII characters in string literals will change
    behavior depending on whether the literal encoding is UTF-8 vs ASCII
    (or some other ASCII derived encoding).
 3. Whether transcoding to the console interface encoding should be
    performed when the literal encoding is not UTF-8.
 4. What the implications are for future support of std::print("{} {}
    {}{}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text").

I think these concerns will be easier to resolve if we first reach
consensus regarding scenarios in which localized text may be provided in
an unexpected encoding. The following is a slightly modified example of
code Hubert previously provided. The example has been modified to
explicitly opt into localized chrono formatting.

    std::print("{:L%p}\n",
    std::chrono::system_clock::now().time_since_epoch());

At issue is the encoding used by locale sensitive chrono formatters.
The example above contains the %p specifier and is locale sensitive
because AM/PM designations may be localized. In a Chinese locale the
desired translation of "PM" is "下午", but the locale will provide the
translation in the locale encoding. As specified in P2093R5, if the
literal encoding is UTF-8, than std::print() will expect the translation
to be provided in UTF-8, but if the locale is not UTF-8-based (e.g.,
Big5; perhaps Shift-JIS for the Japanese 午後 translation), then the
result is mojibake.

I had previously suggested the following possible directions we can
investigate to resolve the encoding concerns.

  * Specialize std::locale facets
    <https://en.cppreference.com/w/cpp/locale/locale> and related I/O
    manipulators like std::put_time()
    <https://en.cppreference.com/w/cpp/io/manip/put_time> for char8_t.
    This would allow std::print() to, when the literal encoding is
    UTF-8, opt-in to use of the UTF-8/char8_t facets and I/O manipulators.
  * When the literal encoding is UTF-8, stipulate that running the
    program in a non-UTF-8 based locale is non-conforming. This would
    effectively require MSVC programmers to, when building code with the
    /utf-8 option, to also force selection of a UTF-8 code page via a
    manifest
    <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
    and require use of Windows 10 build 1903 or later.
  * When the literal encoding is UTF-8, specify that non-UTF-8 based
    locale dependent translations be implicitly transcoded (such
    transcoding should never result in errors except perhaps for memory
    allocation failures).
  * Drop the special case handling for the literal encoding being UTF-8
    and specify that, when bypassing a stream to write directly to the
    console, that the output be implicitly transcoded from the current
    locale dependent encoding (whatever it is) to the console encoding
    (UTF-8).

If we get through all of that, we'll review Corentin's updates in
P2295R2 <https://wg21.link/p2295r3> to address prior SG16 feedback.
Thank you to everyone that already provided additional feedback on the
mailing list!

Tom.


Received on 2021-05-03 23:06:21