sg16: Re: [SG16] Questions for LEWG for P2093R4: Formatted output

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sat, 20 Mar 2021 07:51:06 -0700

> A user with a console/terminal encoding that matches their locale
encoding running a program that outputs a matching non-UTF-8 locale-encoded
string, would, for a program compiled with UTF-8 for the encoding of string
literals, "gain" replacement characters on a Unicode (but not UTF-8,
because then we just don't know what happens) capable terminal. The same
user (with the same build configuration and runtime environment) gets the
locale-encoded string displayed correctly using the "existing facilities".

OK, let's test this hypothesis.

Here is a simple test program:

  #include <locale.h>
  #include <stdio.h>
  #include <stdlib.h>

  int main() {
    setlocale(LC_ALL, "Russian_Russia.866");
    const char* message = "Привет, мир!\n";
    if ((unsigned char)message[0] != 0x8F) {
      puts("wrong encoding");
      abort();
    }
    printf("%s", message);
  }

It sets the locale encoding to the console encoding (CP866 in this case)
and uses a "legacy" printf to print the message. Note that it took some
efforts to write this program because Notepad doesn't even support CP866!

Let's compile it:

>cl /utf-8 test-ru.c
  ...
  test-ru.c(1): warning C4828: The file contains a character starting at
offset 0x8d that is illegal
  in the current source character set (codepage 65001).
  test-ru.c(1): warning C4828: The file contains a character starting at
offset 0x91 that is illegal
  in the current source character set (codepage 65001).
  test-ru.c(1): warning C4828: The file contains a character starting at
offset 0x92 that is illegal
  in the current source character set (codepage 65001).
  test-ru.c(1): warning C4828: The file contains a character starting at
offset 0x95 that is illegal
  in the current source character set (codepage 65001).
  test-ru.c(1): warning C4828: The file contains a character starting at
offset 0x96 that is illegal
  in the current source character set (codepage 65001).
  test-ru.c(1): warning C4828: The file contains a character starting at
offset 0x97 that is illegal
  in the current source character set (codepage 65001).
  ...

and run:

>test-ru
  Привет, ми!\n

It almost works but instead of "Hello, world!" we get "Hello, E!" (where E
is the musical note). So I'd argue that it is worse than other forms of
mojibake because it works for some inputs but not the other. One might
argue that the string comes from a file and not a literal but in this case
it most likely will either be UTF-8 or ACP encoded which may be different
from the console encoding. In that case we may or may not get mojibake
depending on whether ACP and console encoding match which I think is a
terrible API.

Moreover, the locale encoding is completely irrelevant here because printf
doesn't take it into account at all.

Cheers,
Victor

On Sun, Mar 14, 2021 at 11:08 AM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Sun, Mar 14, 2021 at 11:13 AM Victor Zverovich <
> victor.zverovich_at_[hidden]> wrote:
>
>> > If a native Unicode output interface becomes attached to the stream
>>
>> What interface are you referring to? To the best of my knowledge there is
>> no such interface on POSIX so neither P2093 will do transcoding, nor errors
>> will be reported by the native interface in this case.
>>
>
> None that I am aware of in particular. However, extending terminfo, etc.
> so that such an interface becomes available in the future could not be
> discounted as a possibility.
>
>
>>
>> > The lack of UTF-8 encoding validation for output to
>> non-console/non-Unicode capable streams even when the same stream, should
>> it refer to a Unicode-capable output device, may have the UTF-8 encoding
>> validation done > is a bad design choice in my book.
>>
>> In general I would agree but here we are trying to explicitly avoid
>> validation except for the only case where it is neither avoidable nor
>> programmatically detectable, at least when using replacement characters.
>> The only effect is that the user will see invalid sequences replaced by
>> something else on the console. There is just a small improvement in user
>> experience compared to existing facilities because instead of mojibake they
>> would get replacement characters.
>>
>
> A user with a console/terminal encoding that matches their locale encoding
> running a program that outputs a matching non-UTF-8 locale-encoded string,
> would, for a program compiled with UTF-8 for the encoding of string
> literals, "gain" replacement characters on a Unicode (but not UTF-8,
> because then we just don't know what happens) capable terminal. The same
> user (with the same build configuration and runtime environment) gets the
> locale-encoded string displayed correctly using the "existing facilities".
> They will also get the locale-encoded string displayed correctly if their
> terminal doesn't report as being Unicode capable (either because it isn't
> or because the various libraries in their operating environment is not
> capable of taking advantage of the terminal's Unicode capability for now).
>
> So, things can look like they are fine when people start adopting
> std::print (even if they really aren't fine).
>
> I think the following dimensions are relevant in evaluating the proposal:
>
> "Build" properties:
> Encoding used for string literals
>
> "Source" properties:
> Actual string encoding (locale/console/UTF-8) with "challenging" content
> Output method (legacy/std::print)
>
> "Environment" properties:
> Locale encoding
> Output redirection in effect
> Console "legacy" encoding
> Console "Unicode API" encoding (including "none")
>
> "Output" properties:
> Mojibake/replacement characters/observations re: redirected output as
> interpreted using various encodings
>
> In the face of this apparent complexity, a presentation of relevant
> analysis would help. Also note that reports of deployment experience should
> probably identify where they land in the space described above.
>
>
>> So are you suggesting that we should do validation for the case when
>> literal encoding is known to be UTF-8?
>>
>
> For the purpose of making the proposal present less differences between
> "modes" of behaviour, yes.
>
>
>> Anyway, this question should probably be answered by LEWG or SG16.
>>
>
> Yes.
>
>
>>
>> - Victor
>>
>>
>>
>>
>>
>> On Sat, Mar 13, 2021 at 10:43 AM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Sat, Mar 13, 2021 at 11:36 AM Victor Zverovich <
>>> victor.zverovich_at_[hidden]> wrote:
>>>
>>>> Reply to Tom:
>>>>
>>>> > Should this feature move forward without a parallel proposal to
>>>> provide the underlying implementation dependent features need to implement
>>>> std::print()? ... (I believe Victor is already working on a companion
>>>> paper).
>>>>
>>>> Just want to add that this was the main reason for the only SA vote in
>>>> SG16 and I'm indeed working on a separate paper to address this. The latter
>>>> is unnecessary for P2093 but could be useful if users decide to implement
>>>> their own formatted I/O library.
>>>>
>>>> Reply to Hubert:
>>>>
>>>> > Another question is whether the error handling for invalid code unit
>>>> sequences should be left to the native Unicode API if it accepts UTF-8.
>>>>
>>>> I would recommend leaving it to the native API because we won't do
>>>> transcoding in this case and adding extra processing overhead just for
>>>> replacement characters seems undesirable. This is mostly a theoretical
>>>> question though because I am not aware of such API.
>>>>
>>>> > Strings encoded for the locale will then come from things like user
>>>> input, message catalogs/resource files, the system library, etc. (for
>>>> example, strerror).
>>>>
>>>> I don't think it works in practice with console I/O on Windows as my
>>>> and Tom's experiments have demonstrated because you have multiple
>>>> encodings in play. Assumption that there is one encoding that can be
>>>> determined via the global locale is often incorrect.
>>>>
>>>
>>> Sure, the locale-to-console/terminal encoding mismatch is still in play
>>> (but can be said to be an error on the part of the user of the console
>>> application). Yes, maybe APIs are present to change/bypass the
>>> console/terminal encoding; however, application developers are allowed to
>>> document constraints on the supported operating environment.
>>>
>>>
>>>> That said, P2093 still fully supports legacy encodings in the same way
>>>> printf does (by not doing any transcoding in this case).
>>>>
>>>
>>> P2093 uses a condition (that happens to be true by default when
>>> compiling with Clang for *nix) to determine whether to take strings as
>>> being UTF-8 for std::print. If a native Unicode output interface becomes
>>> attached to the stream (which, if no extra explicit testing is done, is
>>> something that might happen only years after an application was
>>> written/built), P2093 might not be transcoding itself, but it will start
>>> treating things as UTF-8 (possibly leaving the native interface to handle
>>> problems).
>>>
>>>
>>>>
>>>> To clarify: P2093 only attempts to conservatively fix known broken
>>>> cases and not assume any specific encoding otherwise. Therefore
>>>>
>>>> > using only "invariant" characters in string literals is a reasonable
>>>> way to write programs that operate under multiple locales.
>>>>
>>>> continues to be "supported" in the same way it is "supported" by
>>>> current facilities.
>>>>
>>>
>>> I don't think it is quite that conservative (as noted above, it tries to
>>> fix cases where it may be controversial whether things are "broken"). At
>>> the same time, I think it is "too conservative" in a sense. The lack of
>>> UTF-8 encoding validation for output to non-console/non-Unicode capable
>>> streams even when the same stream, should it refer to a Unicode-capable
>>> output device, may have the UTF-8 encoding validation done is a bad design
>>> choice in my book. Especially considering that the Unicode-capability, etc.
>>> detection is currently part of a black box in P2093, I think it is fair to
>>> say that, in the case described above, we're actually expecting the strings
>>> to be UTF-8 (and not really tailored to the specifics of what the stream is
>>> attached to). The "feature" of being able to output non-UTF-8 to an
>>> interface that should rightly be used only with UTF-8 without generating
>>> noticeably bad output (i.e., making things "accidentally work") potentially
>>> hides errors. I'm afraid that less-than-informed adoption will occur
>>> because noticing such errors requires specific testing configurations. I
>>> don't know yet if std::print usage normally imposes a large testing matrix,
>>> but it would be useful to know if there are reasons why it wouldn't.
>>>
>>>
>>>>
>>>> Cheers,
>>>> Victor
>>>>
>>>>
>>>> On Thu, Mar 11, 2021 at 9:33 PM Hubert Tong via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>>>>>>
>>>>>> The following are questions/concerns that came up during SG16 review
>>>>>> of P2093 <https://wg21.link/p2093> that are worthy of further
>>>>>> discussion in SG16 and/or LEWG. Most of these issues were discussed in
>>>>>> SG16 and were determined either not to be SG16 concerns or were deemed
>>>>>> issues that for which we did not want to hold back forward progress. These
>>>>>> sentiments were not unanimous.
>>>>>>
>>>>>> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was
>>>>>> taken during our February 10th telecon. The poll was:
>>>>>>
>>>>>> Poll: Forward P2093R3 to LEWG.
>>>>>> - Attendance: 9
>>>>>> SF
>>>>>> F
>>>>>> N
>>>>>> A
>>>>>> SA
>>>>>> 4
>>>>>> 2
>>>>>> 2
>>>>>> 0
>>>>>> 1
>>>>>>
>>>>>> Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>,
>>>>>> are available at:
>>>>>>
>>>>>> - December 9th, 2020 telecon
>>>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
>>>>>> review of P2093R2 <https://wg21.link/p2093r2>.
>>>>>> - February 10th, 2021 telecon
>>>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
>>>>>> review of P2093R3 <https://wg21.link/p2093r3>.
>>>>>>
>>>>>> Questions raised include:
>>>>>>
>>>>>> 1. How should errors in transcoding be handled?
>>>>>> The Unicode recommendation is to substitute a replacement
>>>>>> character for invalid code unit sequences. P2093R4
>>>>>> <https://wg21.link/p2093r4> added wording to this effect.
>>>>>>
>>>>>> Another question is whether the error handling for invalid code unit
>>>>> sequences should be left to the native Unicode API if it accepts UTF-8.
>>>>>
>>>>>>
>>>>>> 1. Should this feature move forward without a parallel proposal
>>>>>> to provide the underlying implementation dependent features need to
>>>>>> implement std::print()?
>>>>>> Specifically, should this feature be blocked on exposing
>>>>>> interfaces to 1) determine if a stream is connected directly to a
>>>>>> terminal/console, and 2) write directly to a terminal/console (potentially
>>>>>> bypassing a stream) using native interfaces where applicable? These
>>>>>> features would be necessary in order to implement a portable version of
>>>>>> std::print(). (I believe Victor is already working on a
>>>>>> companion paper).
>>>>>>
>>>>>> It is also interesting to ask if "line printers" or other
>>>>> text-oriented output devices should be considered for "direct Unicode
>>>>> output capability" behaviours.
>>>>>
>>>>>>
>>>>>> 1. The choice to base behavior on the compile-time choice of
>>>>>> execution character set results in locale settings being ignored at
>>>>>> run-time. Is that ok?
>>>>>> 1. This choice will lead to unexpected results if a program
>>>>>> runs in a non-UTF-8 locale and consumes non-Unicode input (e.g., from
>>>>>> stdin) and then attempts to echo it back.
>>>>>> 2. Additionally, it means that a program that uses only ASCII
>>>>>> characters in string literals will nevertheless behave differently at
>>>>>> run-time depending on the choice of execution character set (which
>>>>>> historically has only affected the encoding of string literals).
>>>>>>
>>>>>> My understanding is that the paper is making an assumption that the
>>>>> choice (via the build mode) of using UTF-8 for the execution character set
>>>>> presumed for literals justifies assuming that plain-char strings "in
>>>>> the vicinity" of the output mechanism are UTF-8 encoded. The paper does not
>>>>> seem to have much coverage over how much a user needs to do (or not) to end
>>>>> up with UTF-8 as the execution character set presumed for literals (plus
>>>>> how new/unique/indicative of intent doing so is within a platform
>>>>> ecosystem). I think it tells us that there's a level of opt-in for MSVC
>>>>> users and it is relatively new for the same (at which point, I think having
>>>>> the user be responsible for using UTF-8 locales is rather reasonable). For
>>>>> Clang, it seems the user just ends up with UTF-8 by default (without really
>>>>> asking for it).
>>>>>
>>>>> I believe the design is hard to justify without the assumption I
>>>>> indicated. I am not convinced that the paper presents information that
>>>>> justifies said assumption. Further to what Tom said, using only "invariant"
>>>>> characters in string literals is a reasonable way to write programs that
>>>>> operate under multiple locales. Strings encoded for the locale will then
>>>>> come from things like user input, message catalogs/resource files, the
>>>>> system library, etc. (for example, strerror). It seems that users
>>>>> with a need for non-UTF-8 locales who also want std::print for the
>>>>> convenience factor (and not the Unicode output) might run into problems. If
>>>>> the argument is that we'll all have -fexec-charset by the time this
>>>>> ships and a non-UTF-8 -fexec-charset should work fine for the users
>>>>> in question, then let that argument be made in the paper.
>>>>>
>>>>>
>>>>>> 1. When the execution character set is not UTF-8, should
>>>>>> conversion to Unicode be performed when writing directly to a Unicode
>>>>>> enabled terminal/console?
>>>>>> 1. If so, should conversions be based on the compile-time literal
>>>>>> encoding or the locale dependent run-time execution encoding?
>>>>>> 2. If the latter, that creates an odd asymmetry with the
>>>>>> behavior when the execution character set is UTF-8. Is that ok?
>>>>>> 2. What are the implications for future support of std::print("{}
>>>>>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")
>>>>>> ?
>>>>>> 1. As proposed, std::print() only produces unambiguously
>>>>>> encoded output when the execution character set is UTF-8 and it is clear
>>>>>> how these cases should be handled in that case.
>>>>>> 2. But how would the behavior be defined when the execution
>>>>>> character set is not UTF-8? Would the arguments be converted to the
>>>>>> execution character set? Or to the locale dependent encoding?
>>>>>> 3. Note that these concerns are relevant for std::format() as
>>>>>> well.
>>>>>>
>>>>>> An additional issue that was not discussed in SG16 relates to Unicode
>>>>>> normalization. As proposed, the expected output will match expectations if
>>>>>> the UTF-8 text does not contain any uses of combining characters. However,
>>>>>> if combining characters are present, either because the text is in NFD or
>>>>>> because there is no precomposed character defined, then the combining
>>>>>> characters may be rendered separately from their base character as a result
>>>>>> of terminal/console interfaces mapping code points rather than grapheme
>>>>>> clusters to columns. Should std::print() also perform NFC
>>>>>> normalization so that characters with precomposed forms are displayed
>>>>>> correctly? (These concerns were explored in P1868
>>>>>> <https://wg21.link/p1868> when it was adopted for C++20; see that
>>>>>> paper for example screenshots; in practice, this is only an issue with the
>>>>>> Windows console).
>>>>>>
>>>>>> It would not be unreasonable for LEWG to send some of these questions
>>>>>> back to SG16 for more analysis.
>>>>>>
>>>>> A question for LEWG: Does the design impose versioning of prebuilt
>>>>> libraries between a UTF-8 build-mode and a non-UTF-8 build mode world?
>>>>>
>>>>>> Tom.
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>> --
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden]
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>>

Received on 2021-03-20 09:51:21