C++ Logo

SG16

Advanced search

Subject: Re: Questions for LEWG for P2093R4: Formatted output
From: Hubert Tong (hubert.reinterpretcast_at_[hidden])
Date: 2021-04-01 15:48:00


On Sat, Mar 20, 2021 at 10:51 AM Victor Zverovich <
victor.zverovich_at_[hidden]> wrote:

> > A user with a console/terminal encoding that matches their locale
> encoding running a program that outputs a matching non-UTF-8 locale-encoded
> string, would, for a program compiled with UTF-8 for the encoding of string
> literals, "gain" replacement characters on a Unicode (but not UTF-8,
> because then we just don't know what happens) capable terminal. The same
> user (with the same build configuration and runtime environment) gets the
> locale-encoded string displayed correctly using the "existing facilities".
>
> OK, let's test this hypothesis.
>
> Here is a simple test program:
>
> #include <locale.h>
> #include <stdio.h>
> #include <stdlib.h>
>

#include <string.h>

>
> int main() {
> setlocale(LC_ALL, "Russian_Russia.866");
> const char* message = "Привет, мир!\n";
> if ((unsigned char)message[0] != 0x8F) {
>

if (memcmp(message, "\x8F\xE0\xA8\xA2\xA5\xE2\x2C\x20\xAC\xA8\xE0\x21", 12)
!= 0) {

> puts("wrong encoding");
> abort();
> }
> printf("%s", message);
> }
>
> It sets the locale encoding to the console encoding (CP866 in this case)
> and uses a "legacy" printf to print the message. Note that it took some
> efforts to write this program because Notepad doesn't even support CP866!
>
> Let's compile it:
>
> >cl /utf-8 test-ru.c
>

Well, I certainly didn't say to compile a file that is not validly-encoded
for the source encoding that the compiler is using. Also, compilers have
been known to have options that separately specify the source encoding and
the encoding used for literals.

> ...
> test-ru.c(1): warning C4828: The file contains a character starting at
> offset 0x8d that is illegal
> in the current source character set (codepage 65001).
> test-ru.c(1): warning C4828: The file contains a character starting at
> offset 0x91 that is illegal
> in the current source character set (codepage 65001).
> test-ru.c(1): warning C4828: The file contains a character starting at
> offset 0x92 that is illegal
> in the current source character set (codepage 65001).
> test-ru.c(1): warning C4828: The file contains a character starting at
> offset 0x95 that is illegal
> in the current source character set (codepage 65001).
> test-ru.c(1): warning C4828: The file contains a character starting at
> offset 0x96 that is illegal
> in the current source character set (codepage 65001).
> test-ru.c(1): warning C4828: The file contains a character starting at
> offset 0x97 that is illegal
> in the current source character set (codepage 65001).
> ...
>
> and run:
>
> >test-ru
> Привет, ми!\n
>
> It almost works but instead of "Hello, world!" we get "Hello, E!" (where E
> is the musical note). So I'd argue that it is worse than other forms of
> mojibake because it works for some inputs but not the other.
>

I don't think this observation is useful because the test is not reflective
of a valid baseline (i.e., a case where the status quo works) for
comparison.

> One might argue that the string comes from a file and not a literal but in
> this case it most likely will either be UTF-8 or ACP encoded which may be
> different from the console encoding. In that case we may or may not get
> mojibake depending on whether ACP and console encoding match which I think
> is a terrible API.
>

Sure, it may be a terrible API, but it's the status quo deployment for some
people. This demonstration seems to be missing the part where std::print
puts out replacement characters (instead of most of the string, and if the
string came from an ACP-encoded file, all of the string). It is understood
that std::print with a string that came from a UTF-8-encoded file will also
print all of the string, but then, if there was a mix of ACP-encoded printf
and UTF-8 std::print, redirection can lead to a file with mixed encoding.

>
> Moreover, the locale encoding is completely irrelevant here because printf
> doesn't take it into account at all.
>
> Cheers,
> Victor
>
>
>
> On Sun, Mar 14, 2021 at 11:08 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Sun, Mar 14, 2021 at 11:13 AM Victor Zverovich <
>> victor.zverovich_at_[hidden]> wrote:
>>
>>> > If a native Unicode output interface becomes attached to the stream
>>>
>>> What interface are you referring to? To the best of my knowledge there
>>> is no such interface on POSIX so neither P2093 will do transcoding, nor
>>> errors will be reported by the native interface in this case.
>>>
>>
>> None that I am aware of in particular. However, extending terminfo, etc.
>> so that such an interface becomes available in the future could not be
>> discounted as a possibility.
>>
>>
>>>
>>> > The lack of UTF-8 encoding validation for output to
>>> non-console/non-Unicode capable streams even when the same stream, should
>>> it refer to a Unicode-capable output device, may have the UTF-8 encoding
>>> validation done > is a bad design choice in my book.
>>>
>>> In general I would agree but here we are trying to explicitly avoid
>>> validation except for the only case where it is neither avoidable nor
>>> programmatically detectable, at least when using replacement characters.
>>> The only effect is that the user will see invalid sequences replaced by
>>> something else on the console. There is just a small improvement in user
>>> experience compared to existing facilities because instead of mojibake they
>>> would get replacement characters.
>>>
>>
>> A user with a console/terminal encoding that matches their locale
>> encoding running a program that outputs a matching non-UTF-8 locale-encoded
>> string, would, for a program compiled with UTF-8 for the encoding of string
>> literals, "gain" replacement characters on a Unicode (but not UTF-8,
>> because then we just don't know what happens) capable terminal. The same
>> user (with the same build configuration and runtime environment) gets the
>> locale-encoded string displayed correctly using the "existing facilities".
>> They will also get the locale-encoded string displayed correctly if their
>> terminal doesn't report as being Unicode capable (either because it isn't
>> or because the various libraries in their operating environment is not
>> capable of taking advantage of the terminal's Unicode capability for now).
>>
>> So, things can look like they are fine when people start adopting
>> std::print (even if they really aren't fine).
>>
>> I think the following dimensions are relevant in evaluating the proposal:
>>
>> "Build" properties:
>> Encoding used for string literals
>>
>> "Source" properties:
>> Actual string encoding (locale/console/UTF-8) with "challenging" content
>> Output method (legacy/std::print)
>>
>> "Environment" properties:
>> Locale encoding
>> Output redirection in effect
>> Console "legacy" encoding
>> Console "Unicode API" encoding (including "none")
>>
>> "Output" properties:
>> Mojibake/replacement characters/observations re: redirected output as
>> interpreted using various encodings
>>
>> In the face of this apparent complexity, a presentation of relevant
>> analysis would help. Also note that reports of deployment experience should
>> probably identify where they land in the space described above.
>>
>>
>>> So are you suggesting that we should do validation for the case when
>>> literal encoding is known to be UTF-8?
>>>
>>
>> For the purpose of making the proposal present less differences between
>> "modes" of behaviour, yes.
>>
>>
>>> Anyway, this question should probably be answered by LEWG or SG16.
>>>
>>
>> Yes.
>>
>>
>>>
>>> - Victor
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Mar 13, 2021 at 10:43 AM Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Sat, Mar 13, 2021 at 11:36 AM Victor Zverovich <
>>>> victor.zverovich_at_[hidden]> wrote:
>>>>
>>>>> Reply to Tom:
>>>>>
>>>>> > Should this feature move forward without a parallel proposal to
>>>>> provide the underlying implementation dependent features need to implement
>>>>> std::print()? ... (I believe Victor is already working on a
>>>>> companion paper).
>>>>>
>>>>> Just want to add that this was the main reason for the only SA vote in
>>>>> SG16 and I'm indeed working on a separate paper to address this. The latter
>>>>> is unnecessary for P2093 but could be useful if users decide to implement
>>>>> their own formatted I/O library.
>>>>>
>>>>> Reply to Hubert:
>>>>>
>>>>> > Another question is whether the error handling for invalid code unit
>>>>> sequences should be left to the native Unicode API if it accepts UTF-8.
>>>>>
>>>>> I would recommend leaving it to the native API because we won't do
>>>>> transcoding in this case and adding extra processing overhead just for
>>>>> replacement characters seems undesirable. This is mostly a theoretical
>>>>> question though because I am not aware of such API.
>>>>>
>>>>> > Strings encoded for the locale will then come from things like user
>>>>> input, message catalogs/resource files, the system library, etc. (for
>>>>> example, strerror).
>>>>>
>>>>> I don't think it works in practice with console I/O on Windows as my
>>>>> and Tom's experiments have demonstrated because you have multiple
>>>>> encodings in play. Assumption that there is one encoding that can be
>>>>> determined via the global locale is often incorrect.
>>>>>
>>>>
>>>> Sure, the locale-to-console/terminal encoding mismatch is still in play
>>>> (but can be said to be an error on the part of the user of the console
>>>> application). Yes, maybe APIs are present to change/bypass the
>>>> console/terminal encoding; however, application developers are allowed to
>>>> document constraints on the supported operating environment.
>>>>
>>>>
>>>>> That said, P2093 still fully supports legacy encodings in the same way
>>>>> printf does (by not doing any transcoding in this case).
>>>>>
>>>>
>>>> P2093 uses a condition (that happens to be true by default when
>>>> compiling with Clang for *nix) to determine whether to take strings as
>>>> being UTF-8 for std::print. If a native Unicode output interface becomes
>>>> attached to the stream (which, if no extra explicit testing is done, is
>>>> something that might happen only years after an application was
>>>> written/built), P2093 might not be transcoding itself, but it will start
>>>> treating things as UTF-8 (possibly leaving the native interface to handle
>>>> problems).
>>>>
>>>>
>>>>>
>>>>> To clarify: P2093 only attempts to conservatively fix known broken
>>>>> cases and not assume any specific encoding otherwise. Therefore
>>>>>
>>>>> > using only "invariant" characters in string literals is a reasonable
>>>>> way to write programs that operate under multiple locales.
>>>>>
>>>>> continues to be "supported" in the same way it is "supported" by
>>>>> current facilities.
>>>>>
>>>>
>>>> I don't think it is quite that conservative (as noted above, it tries
>>>> to fix cases where it may be controversial whether things are "broken"). At
>>>> the same time, I think it is "too conservative" in a sense. The lack of
>>>> UTF-8 encoding validation for output to non-console/non-Unicode capable
>>>> streams even when the same stream, should it refer to a Unicode-capable
>>>> output device, may have the UTF-8 encoding validation done is a bad design
>>>> choice in my book. Especially considering that the Unicode-capability, etc.
>>>> detection is currently part of a black box in P2093, I think it is fair to
>>>> say that, in the case described above, we're actually expecting the strings
>>>> to be UTF-8 (and not really tailored to the specifics of what the stream is
>>>> attached to). The "feature" of being able to output non-UTF-8 to an
>>>> interface that should rightly be used only with UTF-8 without generating
>>>> noticeably bad output (i.e., making things "accidentally work") potentially
>>>> hides errors. I'm afraid that less-than-informed adoption will occur
>>>> because noticing such errors requires specific testing configurations. I
>>>> don't know yet if std::print usage normally imposes a large testing matrix,
>>>> but it would be useful to know if there are reasons why it wouldn't.
>>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>> Victor
>>>>>
>>>>>
>>>>> On Thu, Mar 11, 2021 at 9:33 PM Hubert Tong via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> On Thu, Mar 11, 2021 at 12:26 AM Tom Honermann via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>> std::print("╟≥σσ⌠Θετ≤ ßεΣ πß∞⌡⌠ß⌠Θ∩επ!\n");
>>>>>>>
>>>>>>> The following are questions/concerns that came up during SG16 review
>>>>>>> of P2093 <https://wg21.link/p2093> that are worthy of further
>>>>>>> discussion in SG16 and/or LEWG. Most of these issues were discussed in
>>>>>>> SG16 and were determined either not to be SG16 concerns or were deemed
>>>>>>> issues that for which we did not want to hold back forward progress. These
>>>>>>> sentiments were not unanimous.
>>>>>>>
>>>>>>> The SG16 poll to forward P2093R3 <https://wg21.link/p2093r3> was
>>>>>>> taken during our February 10th telecon. The poll was:
>>>>>>>
>>>>>>> Poll: Forward P2093R3 to LEWG.
>>>>>>> - Attendance: 9
>>>>>>> SF
>>>>>>> F
>>>>>>> N
>>>>>>> A
>>>>>>> SA
>>>>>>> 4
>>>>>>> 2
>>>>>>> 2
>>>>>>> 0
>>>>>>> 1
>>>>>>>
>>>>>>> Minutes for prior SG16 reviews of P2093 <https://wg21.link/p2093>,
>>>>>>> are available at:
>>>>>>>
>>>>>>> - December 9th, 2020 telecon
>>>>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#december-9th-2020>;
>>>>>>> review of P2093R2 <https://wg21.link/p2093r2>.
>>>>>>> - February 10th, 2021 telecon
>>>>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md>;
>>>>>>> review of P2093R3 <https://wg21.link/p2093r3>.
>>>>>>>
>>>>>>> Questions raised include:
>>>>>>>
>>>>>>> 1. How should errors in transcoding be handled?
>>>>>>> The Unicode recommendation is to substitute a replacement
>>>>>>> character for invalid code unit sequences. P2093R4
>>>>>>> <https://wg21.link/p2093r4> added wording to this effect.
>>>>>>>
>>>>>>> Another question is whether the error handling for invalid code unit
>>>>>> sequences should be left to the native Unicode API if it accepts UTF-8.
>>>>>>
>>>>>>>
>>>>>>> 1. Should this feature move forward without a parallel proposal
>>>>>>> to provide the underlying implementation dependent features need to
>>>>>>> implement std::print()?
>>>>>>> Specifically, should this feature be blocked on exposing
>>>>>>> interfaces to 1) determine if a stream is connected directly to a
>>>>>>> terminal/console, and 2) write directly to a terminal/console (potentially
>>>>>>> bypassing a stream) using native interfaces where applicable? These
>>>>>>> features would be necessary in order to implement a portable version of
>>>>>>> std::print(). (I believe Victor is already working on a
>>>>>>> companion paper).
>>>>>>>
>>>>>>> It is also interesting to ask if "line printers" or other
>>>>>> text-oriented output devices should be considered for "direct Unicode
>>>>>> output capability" behaviours.
>>>>>>
>>>>>>>
>>>>>>> 1. The choice to base behavior on the compile-time choice of
>>>>>>> execution character set results in locale settings being ignored at
>>>>>>> run-time. Is that ok?
>>>>>>> 1. This choice will lead to unexpected results if a program
>>>>>>> runs in a non-UTF-8 locale and consumes non-Unicode input (e.g., from
>>>>>>> stdin) and then attempts to echo it back.
>>>>>>> 2. Additionally, it means that a program that uses only ASCII
>>>>>>> characters in string literals will nevertheless behave differently at
>>>>>>> run-time depending on the choice of execution character set (which
>>>>>>> historically has only affected the encoding of string literals).
>>>>>>>
>>>>>>> My understanding is that the paper is making an assumption that the
>>>>>> choice (via the build mode) of using UTF-8 for the execution character set
>>>>>> presumed for literals justifies assuming that plain-char strings "in
>>>>>> the vicinity" of the output mechanism are UTF-8 encoded. The paper does not
>>>>>> seem to have much coverage over how much a user needs to do (or not) to end
>>>>>> up with UTF-8 as the execution character set presumed for literals (plus
>>>>>> how new/unique/indicative of intent doing so is within a platform
>>>>>> ecosystem). I think it tells us that there's a level of opt-in for MSVC
>>>>>> users and it is relatively new for the same (at which point, I think having
>>>>>> the user be responsible for using UTF-8 locales is rather reasonable). For
>>>>>> Clang, it seems the user just ends up with UTF-8 by default (without really
>>>>>> asking for it).
>>>>>>
>>>>>> I believe the design is hard to justify without the assumption I
>>>>>> indicated. I am not convinced that the paper presents information that
>>>>>> justifies said assumption. Further to what Tom said, using only "invariant"
>>>>>> characters in string literals is a reasonable way to write programs that
>>>>>> operate under multiple locales. Strings encoded for the locale will then
>>>>>> come from things like user input, message catalogs/resource files, the
>>>>>> system library, etc. (for example, strerror). It seems that users
>>>>>> with a need for non-UTF-8 locales who also want std::print for the
>>>>>> convenience factor (and not the Unicode output) might run into problems. If
>>>>>> the argument is that we'll all have -fexec-charset by the time this
>>>>>> ships and a non-UTF-8 -fexec-charset should work fine for the users
>>>>>> in question, then let that argument be made in the paper.
>>>>>>
>>>>>>
>>>>>>> 1. When the execution character set is not UTF-8, should
>>>>>>> conversion to Unicode be performed when writing directly to a Unicode
>>>>>>> enabled terminal/console?
>>>>>>> 1. If so, should conversions be based on the compile-time
>>>>>>> literal encoding or the locale dependent run-time execution encoding?
>>>>>>> 2. If the latter, that creates an odd asymmetry with the
>>>>>>> behavior when the execution character set is UTF-8. Is that ok?
>>>>>>> 2. What are the implications for future support of std::print("{}
>>>>>>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text")
>>>>>>> ?
>>>>>>> 1. As proposed, std::print() only produces unambiguously
>>>>>>> encoded output when the execution character set is UTF-8 and it is clear
>>>>>>> how these cases should be handled in that case.
>>>>>>> 2. But how would the behavior be defined when the execution
>>>>>>> character set is not UTF-8? Would the arguments be converted to the
>>>>>>> execution character set? Or to the locale dependent encoding?
>>>>>>> 3. Note that these concerns are relevant for std::format() as
>>>>>>> well.
>>>>>>>
>>>>>>> An additional issue that was not discussed in SG16 relates to
>>>>>>> Unicode normalization. As proposed, the expected output will match
>>>>>>> expectations if the UTF-8 text does not contain any uses of combining
>>>>>>> characters. However, if combining characters are present, either because
>>>>>>> the text is in NFD or because there is no precomposed character defined,
>>>>>>> then the combining characters may be rendered separately from their base
>>>>>>> character as a result of terminal/console interfaces mapping code points
>>>>>>> rather than grapheme clusters to columns. Should std::print() also
>>>>>>> perform NFC normalization so that characters with precomposed forms are
>>>>>>> displayed correctly? (These concerns were explored in P1868
>>>>>>> <https://wg21.link/p1868> when it was adopted for C++20; see that
>>>>>>> paper for example screenshots; in practice, this is only an issue with the
>>>>>>> Windows console).
>>>>>>>
>>>>>>> It would not be unreasonable for LEWG to send some of these
>>>>>>> questions back to SG16 for more analysis.
>>>>>>>
>>>>>> A question for LEWG: Does the design impose versioning of prebuilt
>>>>>> libraries between a UTF-8 build-mode and a non-UTF-8 build mode world?
>>>>>>
>>>>>>> Tom.
>>>>>>> --
>>>>>>> SG16 mailing list
>>>>>>> SG16_at_[hidden]
>>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>>
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>>



SG16 list run by sg16-owner@lists.isocpp.org