sg16: [SG16] Agenda for the 2021-05-26 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 16 May 2021 17:23:26 -0400

SG16 will hold a telecon on Wednesday, May 26th at 19:30 UTC (timezone
conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20210526T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).

The agenda is:

  * P2295R3: Support for UTF-8 as a portable source file encoding
    <https://wg21.link/p2295r3>
      o Review updates intended to address prior SG16 feedback.
  * P2093R6: Formatted output <https://wg21.link/p2093r6>
      o Discuss locale dependent character encoding concerns.

Since we did not get to discuss P2295R3 at our last telecon, it will
again retain the top spot on the agenda followed by P2093R6. Thus, the
agenda looks much the same as for the last telecon (I dropped P2348R0
<https://wg21.link/p2348r0> for now; we won't realistically get to it).

With regard to P2093R6 <https://wg21.link/p2093r6>, the current status
is unchanged; LEWG has referred the paper back to SG16 for further
discussion; please see the LEWG meeting minutes here
<https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
Specifically, LEWG would benefit from additional analysis of previously
deferred questions <http://lists.isocpp.org/lib-ext/2021/03/18189.php>
regarding character encoding concerns, transcoding requirements (or the
lack there of) and the ensuing consequences (or lack there of).

1. How errors in transcoding should be handled. E.g., when transcoding
    from UTF-8 to a UTF-16 based console interface and the UTF-8 input
    is not well-formed.
2. The choice to base behavior on the compile-time choice of literal
    encoding. An implication of the current proposal is that a program
    that contains only ASCII characters in string literals will change
    behavior depending on whether the literal encoding is UTF-8 vs ASCII
    (or some other ASCII derived encoding).
3. Whether transcoding to the console interface encoding should be
    performed when the literal encoding is not UTF-8.
4. What the implications are for future support of std::print("{} {}
    {}{}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text").

At our last telecon, we focused on how to handle ill-formed inputs, but
did not much discuss how such inputs arise. Now that LWG3547
<https://cplusplus.github.io/LWG/issue3547> has been effectively (though
not officially) resolved by P2372R1 <https://wg21.link/p2372r1>, we have
a concrete example of how the std::print() facility itself can produce
ill-formed input (assuming that std::print() transcodes all inputs using
the same encoding). I would like to start with this example as I think
it is fundamental to how we choose to answer the above questions.

    std::print("{:L%p}\n",
    std::chrono::system_clock::now().time_since_epoch());

At issue is the encoding used by chrono formatters specified with the L
option to request a locale specific form. The example above contains the
%p specifier with the L option. In a Chinese locale the desired
translation of "PM" is "下午", but the locale will provide the translation
in the locale encoding. As specified in P2093R6, if the literal
encoding is UTF-8, than std::print() will expect the translation to be
provided in UTF-8, but if the locale is not UTF-8-based (e.g., Big5;
perhaps Shift-JIS for the Japanese 午後 translation), then the result is
mojibake.

These are possible directions we can investigate to resolve the encoding
concerns.

  * Specialize std::locale facets
    <https://en.cppreference.com/w/cpp/locale/locale> and related I/O
    manipulators like std::put_time()
    <https://en.cppreference.com/w/cpp/io/manip/put_time> for char8_t.
    This would allow std::print() to, when the literal encoding is
    UTF-8, opt-in to use of the UTF-8/char8_t facets and I/O manipulators.
  * When the literal encoding is UTF-8, stipulate that running the
    program in a non-UTF-8 based locale is non-conforming. This would
    effectively require MSVC programmers to, when building code with the
    /utf-8 option, to also force selection of a UTF-8 code page via a
    manifest
    <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
    and require use of Windows 10 build 1903 or later.
  * When the literal encoding is UTF-8, specify that non-UTF-8 based
    locale dependent translations be implicitly transcoded (such
    transcoding should never result in errors except perhaps for memory
    allocation failures).
  * Drop the special case handling for the literal encoding being UTF-8
    and specify that, when bypassing a stream to write directly to the
    console, that the output be implicitly transcoded from the current
    locale dependent encoding (whatever it is) to the console encoding
    (UTF-8).

Tom.

Received on 2021-05-16 16:23:31