sg16: [SG16] Questions for LEWG for P2093R4: Formatted output

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 11 Mar 2021 00:26:35 -0500

std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");

The following are questions/concerns that came up during SG16 review of
P2093 <https://wg21.link/p2093> that are worthy of further discussion in
SG16 and/or LEWG. Most of these issues were discussed in SG16 and were
determined either not to be SG16 concerns or were deemed issues that for
which we did not want to hold back forward progress. These sentiments
were not unanimous.

The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was taken
during our February 10th telecon. The poll was:

    Poll: Forward P2093R3 to LEWG.
    - Attendance: 9

    SF
     F
     N
     A
     SA
    4
     2
     2
     0
     1

Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>, are
available at:

  * December 9th, 2020 telecon
    <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
    review of P2093R2 <https://wg21.link/p2093r2>.
  * February 10th, 2021 telecon
    <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
    review of P2093R3 <https://wg21.link/p2093r3>.

Questions raised include:

1. How should errors in transcoding be handled?
    The Unicode recommendation is to substitute a replacement character
    for invalid code unit sequences. P2093R4 <https://wg21.link/p2093r4>
    added wording to this effect.
2. Should this feature move forward without a parallel proposal to
    provide the underlying implementation dependent features need to
    implement std::print()?
    Specifically, should this feature be blocked on exposing interfaces
    to 1) determine if a stream is connected directly to a
    terminal/console, and 2) write directly to a terminal/console
    (potentially bypassing a stream) using native interfaces where
    applicable? These features would be necessary in order to implement
    a portable version of std::print(). (I believe Victor is already
    working on a companion paper).
3. The choice to base behavior on the compile-time choice of execution
    character set results in locale settings being ignored at run-time.
    Is that ok?
     1. This choice will lead to unexpected results if a program runs in
        a non-UTF-8 locale and consumes non-Unicode input (e.g., from
        stdin) and then attempts to echo it back.
     2. Additionally, it means that a program that uses only ASCII
        characters in string literals will nevertheless behave
        differently at run-time depending on the choice of execution
        character set (which historically has only affected the encoding
        of string literals).
4. When the execution character set is not UTF-8, should conversion to
    Unicode be performed when writing directly to a Unicode enabled
    terminal/console?
     1. If so, should conversions be based on the compile-time literal
        encoding or the locale dependent run-time execution encoding?
     2. If the latter, that creates an odd asymmetry with the behavior
        when the execution character set is UTF-8. Is that ok?
5. What are the implications for future support of std::print("{} {} {}
    {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")?
     1. As proposed, std::print() only produces unambiguously encoded
        output when the execution character set is UTF-8 and it is clear
        how these cases should be handled in that case.
     2. But how would the behavior be defined when the execution
        character set is not UTF-8? Would the arguments be converted to
        the execution character set? Or to the locale dependent encoding?
     3. Note that these concerns are relevant for std::format() as well.

An additional issue that was not discussed in SG16 relates to Unicode
normalization. As proposed, the expected output will match expectations
if the UTF-8 text does not contain any uses of combining characters.
However, if combining characters are present, either because the text is
in NFD or because there is no precomposed character defined, then the
combining characters may be rendered separately from their base
character as a result of terminal/console interfaces mapping code points
rather than grapheme clusters to columns. Should std::print() also
perform NFC normalization so that characters with precomposed forms are
displayed correctly? (These concerns were explored in P1868
<https://wg21.link/p1868> when it was adopted for C++20; see that paper
for example screenshots; in practice, this is only an issue with the
Windows console).

It would not be unreasonable for LEWG to send some of these questions
back to SG16 for more analysis.

Tom.

Received on 2021-03-10 23:26:38