sg16: Re: [SG16] Questions for LEWG for P2093R4: Formatted output

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sat, 13 Mar 2021 08:36:10 -0800

Reply to Tom:

> Should this feature move forward without a parallel proposal to provide
the underlying implementation dependent features need to implement
std::print()? ... (I believe Victor is already working on a companion
paper).

Just want to add that this was the main reason for the only SA vote in SG16
and I'm indeed working on a separate paper to address this. The latter is
unnecessary for P2093 but could be useful if users decide to implement
their own formatted I/O library.

Reply to Hubert:

> Another question is whether the error handling for invalid code unit
sequences should be left to the native Unicode API if it accepts UTF-8.

I would recommend leaving it to the native API because we won't do
transcoding in this case and adding extra processing overhead just for
replacement characters seems undesirable. This is mostly a theoretical
question though because I am not aware of such API.

> Strings encoded for the locale will then come from things like user
input, message catalogs/resource files, the system library, etc. (for
example, strerror).

I don't think it works in practice with console I/O on Windows as my and
Tom's experiments have demonstrated because you have multiple encodings in
play. Assumption that there is one encoding that can be determined via the
global locale is often incorrect. That said, P2093 still fully supports
legacy encodings in the same way printf does (by not doing any transcoding
in this case).

To clarify: P2093 only attempts to conservatively fix known broken cases
and not assume any specific encoding otherwise. Therefore

> using only "invariant" characters in string literals is a reasonable way
to write programs that operate under multiple locales.

continues to be "supported" in the same way it is "supported" by current
facilities.

Cheers,
Victor

On Thu, Mar 11, 2021 at 9:33 PM Hubert Tong via SG16 <sg16_at_[hidden]>
wrote:

> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>>
>> The following are questions/concerns that came up during SG16 review of
>> P2093 <https://wg21.link/p2093> that are worthy of further discussion in
>> SG16 and/or LEWG. Most of these issues were discussed in SG16 and were
>> determined either not to be SG16 concerns or were deemed issues that for
>> which we did not want to hold back forward progress. These sentiments were
>> not unanimous.
>>
>> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was taken
>> during our February 10th telecon. The poll was:
>>
>> Poll: Forward P2093R3 to LEWG.
>> - Attendance: 9
>> SF
>> F
>> N
>> A
>> SA
>> 4
>> 2
>> 2
>> 0
>> 1
>>
>> Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>, are
>> available at:
>>
>> - December 9th, 2020 telecon
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
>> review of P2093R2 <https://wg21.link/p2093r2>.
>> - February 10th, 2021 telecon
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
>> review of P2093R3 <https://wg21.link/p2093r3>.
>>
>> Questions raised include:
>>
>> 1. How should errors in transcoding be handled?
>> The Unicode recommendation is to substitute a replacement character
>> for invalid code unit sequences. P2093R4 <https://wg21.link/p2093r4>
>> added wording to this effect.
>>
>> Another question is whether the error handling for invalid code unit
> sequences should be left to the native Unicode API if it accepts UTF-8.
>
>>
>> 1. Should this feature move forward without a parallel proposal to
>> provide the underlying implementation dependent features need to implement
>> std::print()?
>> Specifically, should this feature be blocked on exposing interfaces
>> to 1) determine if a stream is connected directly to a terminal/console,
>> and 2) write directly to a terminal/console (potentially bypassing a
>> stream) using native interfaces where applicable? These features would be
>> necessary in order to implement a portable version of std::print().
>> (I believe Victor is already working on a companion paper).
>>
>> It is also interesting to ask if "line printers" or other text-oriented
> output devices should be considered for "direct Unicode output capability"
> behaviours.
>
>>
>> 1. The choice to base behavior on the compile-time choice of
>> execution character set results in locale settings being ignored at
>> run-time. Is that ok?
>> 1. This choice will lead to unexpected results if a program runs
>> in a non-UTF-8 locale and consumes non-Unicode input (e.g., from stdin) and
>> then attempts to echo it back.
>> 2. Additionally, it means that a program that uses only ASCII
>> characters in string literals will nevertheless behave differently at
>> run-time depending on the choice of execution character set (which
>> historically has only affected the encoding of string literals).
>>
>> My understanding is that the paper is making an assumption that the
> choice (via the build mode) of using UTF-8 for the execution character set
> presumed for literals justifies assuming that plain-char strings "in the
> vicinity" of the output mechanism are UTF-8 encoded. The paper does not
> seem to have much coverage over how much a user needs to do (or not) to end
> up with UTF-8 as the execution character set presumed for literals (plus
> how new/unique/indicative of intent doing so is within a platform
> ecosystem). I think it tells us that there's a level of opt-in for MSVC
> users and it is relatively new for the same (at which point, I think having
> the user be responsible for using UTF-8 locales is rather reasonable). For
> Clang, it seems the user just ends up with UTF-8 by default (without really
> asking for it).
>
> I believe the design is hard to justify without the assumption I
> indicated. I am not convinced that the paper presents information that
> justifies said assumption. Further to what Tom said, using only "invariant"
> characters in string literals is a reasonable way to write programs that
> operate under multiple locales. Strings encoded for the locale will then
> come from things like user input, message catalogs/resource files, the
> system library, etc. (for example, strerror). It seems that users with a
> need for non-UTF-8 locales who also want std::print for the convenience
> factor (and not the Unicode output) might run into problems. If the
> argument is that we'll all have -fexec-charset by the time this ships and
> a non-UTF-8 -fexec-charset should work fine for the users in question,
> then let that argument be made in the paper.
>
>
>> 1. When the execution character set is not UTF-8, should conversion
>> to Unicode be performed when writing directly to a Unicode enabled
>> terminal/console?
>> 1. If so, should conversions be based on the compile-time literal
>> encoding or the locale dependent run-time execution encoding?
>> 2. If the latter, that creates an odd asymmetry with the behavior
>> when the execution character set is UTF-8. Is that ok?
>> 2. What are the implications for future support of std::print("{}
>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")
>> ?
>> 1. As proposed, std::print() only produces unambiguously encoded
>> output when the execution character set is UTF-8 and it is clear how these
>> cases should be handled in that case.
>> 2. But how would the behavior be defined when the execution
>> character set is not UTF-8? Would the arguments be converted to the
>> execution character set? Or to the locale dependent encoding?
>> 3. Note that these concerns are relevant for std::format() as well.
>>
>> An additional issue that was not discussed in SG16 relates to Unicode
>> normalization. As proposed, the expected output will match expectations if
>> the UTF-8 text does not contain any uses of combining characters. However,
>> if combining characters are present, either because the text is in NFD or
>> because there is no precomposed character defined, then the combining
>> characters may be rendered separately from their base character as a result
>> of terminal/console interfaces mapping code points rather than grapheme
>> clusters to columns. Should std::print() also perform NFC normalization
>> so that characters with precomposed forms are displayed correctly? (These
>> concerns were explored in P1868 <https://wg21.link/p1868> when it was
>> adopted for C++20; see that paper for example screenshots; in practice,
>> this is only an issue with the Windows console).
>>
>> It would not be unreasonable for LEWG to send some of these questions
>> back to SG16 for more analysis.
>>
> A question for LEWG: Does the design impose versioning of prebuilt
> libraries between a UTF-8 build-mode and a non-UTF-8 build mode world?
>
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-03-13 10:36:26