std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
The following are questions/concerns that came up during SG16
review of P2093 that are worthy of
further discussion in SG16 and/or LEWG. Most of these issues were
discussed in SG16 and were determined either not to be SG16
concerns or were deemed issues that for which we did not want to
hold back forward progress. These sentiments were not unanimous.
The SG16 poll to forward P2093R3 was taken during
our February 10th telecon. The poll was:
Poll: Forward P2093R3 to LEWG.
- Attendance: 9
Minutes for prior SG16 reviews of P2093, are available at:
Questions raised include:
- How should errors in transcoding be handled?
The Unicode recommendation is to substitute a replacement
character for invalid code unit sequences. P2093R4
added wording to this effect.
- Should this feature move forward without a parallel proposal
to provide the underlying implementation dependent features need
to implement std::print()?
Specifically, should this feature be blocked on exposing
interfaces to 1) determine if a stream is connected directly to
a terminal/console, and 2) write directly to a terminal/console
(potentially bypassing a stream) using native interfaces where
applicable? These features would be necessary in order to
implement a portable version of std::print(). (I
believe Victor is already working on a companion paper).
- The choice to base behavior on the compile-time choice of
execution character set results in locale settings being ignored
at run-time. Is that ok?
- This choice will lead to unexpected results if a program
runs in a non-UTF-8 locale and consumes non-Unicode input
(e.g., from stdin) and then attempts to echo it back.
- Additionally, it means that a program that uses only ASCII
characters in string literals will nevertheless behave
differently at run-time depending on the choice of execution
character set (which historically has only affected the
encoding of string literals).
- When the execution character set is not UTF-8, should
conversion to Unicode be performed when writing directly to a
Unicode enabled terminal/console?
- If so, should conversions be based on the compile-time
literal encoding or the locale dependent run-time execution
encoding?
- If the latter, that creates an odd asymmetry with the
behavior when the execution character set is UTF-8. Is that
ok?
- What are the implications for future support of std::print("{}
{} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text",
U"UTF-32 text")?
- As proposed, std::print() only produces
unambiguously encoded output when the execution character set
is UTF-8 and it is clear how these cases should be handled in
that case.
- But how would the behavior be defined when the execution
character set is not UTF-8? Would the arguments be converted
to the execution character set? Or to the locale dependent
encoding?
- Note that these concerns are relevant for std::format()
as well.
An additional issue that was not discussed in SG16 relates to
Unicode normalization. As proposed, the expected output will
match expectations if the UTF-8 text does not contain any uses of
combining characters. However, if combining characters are
present, either because the text is in NFD or because there is no
precomposed character defined, then the combining characters may
be rendered separately from their base character as a result of
terminal/console interfaces mapping code points rather than
grapheme clusters to columns. Should std::print() also
perform NFC normalization so that characters with precomposed
forms are displayed correctly? (These concerns were explored in P1868
when it was adopted for C++20; see that paper for example
screenshots; in practice, this is only an issue with the Windows
console).
It would not be unreasonable for LEWG to send some of these
questions back to SG16 for more analysis.
Tom.