sg16: Re: [SG16] Questions for LEWG for P2093R4: Formatted output

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sun, 14 Mar 2021 08:12:56 -0700

> If a native Unicode output interface becomes attached to the stream

What interface are you referring to? To the best of my knowledge there is
no such interface on POSIX so neither P2093 will do transcoding, nor errors
will be reported by the native interface in this case.

> The lack of UTF-8 encoding validation for output to
non-console/non-Unicode capable streams even when the same stream, should
it refer to a Unicode-capable output device, may have the UTF-8 encoding
validation done > is a bad design choice in my book.

In general I would agree but here we are trying to explicitly avoid
validation except for the only case where it is neither avoidable nor
programmatically detectable, at least when using replacement characters.
The only effect is that the user will see invalid sequences replaced by
something else on the console. There is just a small improvement in user
experience compared to existing facilities because instead of mojibake they
would get replacement characters.

So are you suggesting that we should do validation for the case when
literal encoding is known to be UTF-8? This will incur unavoidable and
often unnecessary (if you already have valid data) overhead and I don't
think we do validations in other places where we assume specific encoding
in the standard library. Anyway, this question should probably be answered
by LEWG or SG16.

- Victor

On Sat, Mar 13, 2021 at 10:43 AM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Sat, Mar 13, 2021 at 11:36 AM Victor Zverovich <
> victor.zverovich_at_[hidden]> wrote:
>
>> Reply to Tom:
>>
>> > Should this feature move forward without a parallel proposal to provide
>> the underlying implementation dependent features need to implement
>> std::print()? ... (I believe Victor is already working on a companion
>> paper).
>>
>> Just want to add that this was the main reason for the only SA vote in
>> SG16 and I'm indeed working on a separate paper to address this. The latter
>> is unnecessary for P2093 but could be useful if users decide to implement
>> their own formatted I/O library.
>>
>> Reply to Hubert:
>>
>> > Another question is whether the error handling for invalid code unit
>> sequences should be left to the native Unicode API if it accepts UTF-8.
>>
>> I would recommend leaving it to the native API because we won't do
>> transcoding in this case and adding extra processing overhead just for
>> replacement characters seems undesirable. This is mostly a theoretical
>> question though because I am not aware of such API.
>>
>> > Strings encoded for the locale will then come from things like user
>> input, message catalogs/resource files, the system library, etc. (for
>> example, strerror).
>>
>> I don't think it works in practice with console I/O on Windows as my and
>> Tom's experiments have demonstrated because you have multiple encodings in
>> play. Assumption that there is one encoding that can be determined via the
>> global locale is often incorrect.
>>
>
> Sure, the locale-to-console/terminal encoding mismatch is still in play
> (but can be said to be an error on the part of the user of the console
> application). Yes, maybe APIs are present to change/bypass the
> console/terminal encoding; however, application developers are allowed to
> document constraints on the supported operating environment.
>
>
>> That said, P2093 still fully supports legacy encodings in the same way
>> printf does (by not doing any transcoding in this case).
>>
>
> P2093 uses a condition (that happens to be true by default when compiling
> with Clang for *nix) to determine whether to take strings as being UTF-8
> for std::print. If a native Unicode output interface becomes attached to
> the stream (which, if no extra explicit testing is done, is something that
> might happen only years after an application was written/built), P2093
> might not be transcoding itself, but it will start treating things as UTF-8
> (possibly leaving the native interface to handle problems).
>
>
>>
>> To clarify: P2093 only attempts to conservatively fix known broken cases
>> and not assume any specific encoding otherwise. Therefore
>>
>> > using only "invariant" characters in string literals is a reasonable
>> way to write programs that operate under multiple locales.
>>
>> continues to be "supported" in the same way it is "supported" by current
>> facilities.
>>
>
> I don't think it is quite that conservative (as noted above, it tries to
> fix cases where it may be controversial whether things are "broken"). At
> the same time, I think it is "too conservative" in a sense. The lack of
> UTF-8 encoding validation for output to non-console/non-Unicode capable
> streams even when the same stream, should it refer to a Unicode-capable
> output device, may have the UTF-8 encoding validation done is a bad design
> choice in my book. Especially considering that the Unicode-capability, etc.
> detection is currently part of a black box in P2093, I think it is fair to
> say that, in the case described above, we're actually expecting the strings
> to be UTF-8 (and not really tailored to the specifics of what the stream is
> attached to). The "feature" of being able to output non-UTF-8 to an
> interface that should rightly be used only with UTF-8 without generating
> noticeably bad output (i.e., making things "accidentally work") potentially
> hides errors. I'm afraid that less-than-informed adoption will occur
> because noticing such errors requires specific testing configurations. I
> don't know yet if std::print usage normally imposes a large testing matrix,
> but it would be useful to know if there are reasons why it wouldn't.
>
>
>>
>> Cheers,
>> Victor
>>
>>
>> On Thu, Mar 11, 2021 at 9:33 PM Hubert Tong via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>>>>
>>>> The following are questions/concerns that came up during SG16 review of
>>>> P2093 <https://wg21.link/p2093> that are worthy of further discussion
>>>> in SG16 and/or LEWG. Most of these issues were discussed in SG16 and were
>>>> determined either not to be SG16 concerns or were deemed issues that for
>>>> which we did not want to hold back forward progress. These sentiments were
>>>> not unanimous.
>>>>
>>>> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was taken
>>>> during our February 10th telecon. The poll was:
>>>>
>>>> Poll: Forward P2093R3 to LEWG.
>>>> - Attendance: 9
>>>> SF
>>>> F
>>>> N
>>>> A
>>>> SA
>>>> 4
>>>> 2
>>>> 2
>>>> 0
>>>> 1
>>>>
>>>> Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>, are
>>>> available at:
>>>>
>>>> - December 9th, 2020 telecon
>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
>>>> review of P2093R2 <https://wg21.link/p2093r2>.
>>>> - February 10th, 2021 telecon
>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
>>>> review of P2093R3 <https://wg21.link/p2093r3>.
>>>>
>>>> Questions raised include:
>>>>
>>>> 1. How should errors in transcoding be handled?
>>>> The Unicode recommendation is to substitute a replacement character
>>>> for invalid code unit sequences. P2093R4
>>>> <https://wg21.link/p2093r4> added wording to this effect.
>>>>
>>>> Another question is whether the error handling for invalid code unit
>>> sequences should be left to the native Unicode API if it accepts UTF-8.
>>>
>>>>
>>>> 1. Should this feature move forward without a parallel proposal to
>>>> provide the underlying implementation dependent features need to implement
>>>> std::print()?
>>>> Specifically, should this feature be blocked on exposing interfaces
>>>> to 1) determine if a stream is connected directly to a terminal/console,
>>>> and 2) write directly to a terminal/console (potentially bypassing a
>>>> stream) using native interfaces where applicable? These features would be
>>>> necessary in order to implement a portable version of std::print().
>>>> (I believe Victor is already working on a companion paper).
>>>>
>>>> It is also interesting to ask if "line printers" or other text-oriented
>>> output devices should be considered for "direct Unicode output capability"
>>> behaviours.
>>>
>>>>
>>>> 1. The choice to base behavior on the compile-time choice of
>>>> execution character set results in locale settings being ignored at
>>>> run-time. Is that ok?
>>>> 1. This choice will lead to unexpected results if a program runs
>>>> in a non-UTF-8 locale and consumes non-Unicode input (e.g., from stdin) and
>>>> then attempts to echo it back.
>>>> 2. Additionally, it means that a program that uses only ASCII
>>>> characters in string literals will nevertheless behave differently at
>>>> run-time depending on the choice of execution character set (which
>>>> historically has only affected the encoding of string literals).
>>>>
>>>> My understanding is that the paper is making an assumption that the
>>> choice (via the build mode) of using UTF-8 for the execution character set
>>> presumed for literals justifies assuming that plain-char strings "in
>>> the vicinity" of the output mechanism are UTF-8 encoded. The paper does not
>>> seem to have much coverage over how much a user needs to do (or not) to end
>>> up with UTF-8 as the execution character set presumed for literals (plus
>>> how new/unique/indicative of intent doing so is within a platform
>>> ecosystem). I think it tells us that there's a level of opt-in for MSVC
>>> users and it is relatively new for the same (at which point, I think having
>>> the user be responsible for using UTF-8 locales is rather reasonable). For
>>> Clang, it seems the user just ends up with UTF-8 by default (without really
>>> asking for it).
>>>
>>> I believe the design is hard to justify without the assumption I
>>> indicated. I am not convinced that the paper presents information that
>>> justifies said assumption. Further to what Tom said, using only "invariant"
>>> characters in string literals is a reasonable way to write programs that
>>> operate under multiple locales. Strings encoded for the locale will then
>>> come from things like user input, message catalogs/resource files, the
>>> system library, etc. (for example, strerror). It seems that users with
>>> a need for non-UTF-8 locales who also want std::print for the
>>> convenience factor (and not the Unicode output) might run into problems. If
>>> the argument is that we'll all have -fexec-charset by the time this
>>> ships and a non-UTF-8 -fexec-charset should work fine for the users in
>>> question, then let that argument be made in the paper.
>>>
>>>
>>>> 1. When the execution character set is not UTF-8, should conversion
>>>> to Unicode be performed when writing directly to a Unicode enabled
>>>> terminal/console?
>>>> 1. If so, should conversions be based on the compile-time literal
>>>> encoding or the locale dependent run-time execution encoding?
>>>> 2. If the latter, that creates an odd asymmetry with the
>>>> behavior when the execution character set is UTF-8. Is that ok?
>>>> 2. What are the implications for future support of std::print("{}
>>>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")
>>>> ?
>>>> 1. As proposed, std::print() only produces unambiguously encoded
>>>> output when the execution character set is UTF-8 and it is clear how these
>>>> cases should be handled in that case.
>>>> 2. But how would the behavior be defined when the execution
>>>> character set is not UTF-8? Would the arguments be converted to the
>>>> execution character set? Or to the locale dependent encoding?
>>>> 3. Note that these concerns are relevant for std::format() as
>>>> well.
>>>>
>>>> An additional issue that was not discussed in SG16 relates to Unicode
>>>> normalization. As proposed, the expected output will match expectations if
>>>> the UTF-8 text does not contain any uses of combining characters. However,
>>>> if combining characters are present, either because the text is in NFD or
>>>> because there is no precomposed character defined, then the combining
>>>> characters may be rendered separately from their base character as a result
>>>> of terminal/console interfaces mapping code points rather than grapheme
>>>> clusters to columns. Should std::print() also perform NFC
>>>> normalization so that characters with precomposed forms are displayed
>>>> correctly? (These concerns were explored in P1868
>>>> <https://wg21.link/p1868> when it was adopted for C++20; see that
>>>> paper for example screenshots; in practice, this is only an issue with the
>>>> Windows console).
>>>>
>>>> It would not be unreasonable for LEWG to send some of these questions
>>>> back to SG16 for more analysis.
>>>>
>>> A question for LEWG: Does the design impose versioning of prebuilt
>>> libraries between a UTF-8 build-mode and a non-UTF-8 build mode world?
>>>
>>>> Tom.
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>

Received on 2021-03-14 10:13:11