sg16: Re: [SG16] Questions for LEWG for P2093R4: Formatted output

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Fri, 12 Mar 2021 00:32:55 -0500

On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>
> The following are questions/concerns that came up during SG16 review of
> P2093 <https://wg21.link/p2093> that are worthy of further discussion in
> SG16 and/or LEWG. Most of these issues were discussed in SG16 and were
> determined either not to be SG16 concerns or were deemed issues that for
> which we did not want to hold back forward progress. These sentiments were
> not unanimous.
>
> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was taken
> during our February 10th telecon. The poll was:
>
> Poll: Forward P2093R3 to LEWG.
> - Attendance: 9
> SF
> F
> N
> A
> SA
> 4
> 2
> 2
> 0
> 1
>
> Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>, are
> available at:
>
> - December 9th, 2020 telecon
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
> review of P2093R2 <https://wg21.link/p2093r2>.
> - February 10th, 2021 telecon
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
> review of P2093R3 <https://wg21.link/p2093r3>.
>
> Questions raised include:
>
> 1. How should errors in transcoding be handled?
> The Unicode recommendation is to substitute a replacement character
> for invalid code unit sequences. P2093R4 <https://wg21.link/p2093r4>
> added wording to this effect.
>
> Another question is whether the error handling for invalid code unit
sequences should be left to the native Unicode API if it accepts UTF-8.

>
> 1. Should this feature move forward without a parallel proposal to
> provide the underlying implementation dependent features need to implement
> std::print()?
> Specifically, should this feature be blocked on exposing interfaces to
> 1) determine if a stream is connected directly to a terminal/console, and
> 2) write directly to a terminal/console (potentially bypassing a stream)
> using native interfaces where applicable? These features would be
> necessary in order to implement a portable version of std::print().
> (I believe Victor is already working on a companion paper).
>
> It is also interesting to ask if "line printers" or other text-oriented
output devices should be considered for "direct Unicode output capability"
behaviours.

>
> 1. The choice to base behavior on the compile-time choice of execution
> character set results in locale settings being ignored at run-time. Is
> that ok?
> 1. This choice will lead to unexpected results if a program runs in
> a non-UTF-8 locale and consumes non-Unicode input (e.g., from stdin) and
> then attempts to echo it back.
> 2. Additionally, it means that a program that uses only ASCII
> characters in string literals will nevertheless behave differently at
> run-time depending on the choice of execution character set (which
> historically has only affected the encoding of string literals).
>
> My understanding is that the paper is making an assumption that the choice
(via the build mode) of using UTF-8 for the execution character set
presumed for literals justifies assuming that plain-char strings "in the
vicinity" of the output mechanism are UTF-8 encoded. The paper does not
seem to have much coverage over how much a user needs to do (or not) to end
up with UTF-8 as the execution character set presumed for literals (plus
how new/unique/indicative of intent doing so is within a platform
ecosystem). I think it tells us that there's a level of opt-in for MSVC
users and it is relatively new for the same (at which point, I think having
the user be responsible for using UTF-8 locales is rather reasonable). For
Clang, it seems the user just ends up with UTF-8 by default (without really
asking for it).

I believe the design is hard to justify without the assumption I indicated.
I am not convinced that the paper presents information that justifies said
assumption. Further to what Tom said, using only "invariant" characters in
string literals is a reasonable way to write programs that operate under
multiple locales. Strings encoded for the locale will then come from things
like user input, message catalogs/resource files, the system library, etc.
(for example, strerror). It seems that users with a need for non-UTF-8
locales who also want std::print for the convenience factor (and not the
Unicode output) might run into problems. If the argument is that we'll all
have -fexec-charset by the time this ships and a non-UTF-8 -fexec-charset
should work fine for the users in question, then let that argument be made
in the paper.

> 1. When the execution character set is not UTF-8, should conversion to
> Unicode be performed when writing directly to a Unicode enabled
> terminal/console?
> 1. If so, should conversions be based on the compile-time literal
> encoding or the locale dependent run-time execution encoding?
> 2. If the latter, that creates an odd asymmetry with the behavior
> when the execution character set is UTF-8. Is that ok?
> 2. What are the implications for future support of std::print("{}
> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")
> ?
> 1. As proposed, std::print() only produces unambiguously encoded
> output when the execution character set is UTF-8 and it is clear how these
> cases should be handled in that case.
> 2. But how would the behavior be defined when the execution
> character set is not UTF-8? Would the arguments be converted to the
> execution character set? Or to the locale dependent encoding?
> 3. Note that these concerns are relevant for std::format() as well.
>
> An additional issue that was not discussed in SG16 relates to Unicode
> normalization. As proposed, the expected output will match expectations if
> the UTF-8 text does not contain any uses of combining characters. However,
> if combining characters are present, either because the text is in NFD or
> because there is no precomposed character defined, then the combining
> characters may be rendered separately from their base character as a result
> of terminal/console interfaces mapping code points rather than grapheme
> clusters to columns. Should std::print() also perform NFC normalization
> so that characters with precomposed forms are displayed correctly? (These
> concerns were explored in P1868 <https://wg21.link/p1868> when it was
> adopted for C++20; see that paper for example screenshots; in practice,
> this is only an issue with the Windows console).
>
> It would not be unreasonable for LEWG to send some of these questions back
> to SG16 for more analysis.
>
A question for LEWG: Does the design impose versioning of prebuilt
libraries between a UTF-8 build-mode and a non-UTF-8 build mode world?

> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-03-11 23:33:26