On 11/26/20 2:56 PM, Victor Zverovich wrote:
Hi Tom,

Thanks for the detailed feedback.

My pleasure, thank you for the paper!

> it is not at all clear to me how that could be done correctly for wide C++ streams

AFAICS wide streams are virtually unused nowadays. For example codesearch gives 114 matches for fwide (https://codesearch.isocpp.org/cgi-bin/cgi_ppsearch?q=fwide&search=Search) and those are mostly in standard library implementations. In any case, I think this belongs in a separate paper.
I think I was unclear.  My intent here was to note that, if C++ streams were used as the default output, that it is unclear to me how a hypothetical std::wprint() would work.  This was intended as another point in favor of using C streams.  I agree that this is subject matter for a different paper.

> The desired behavior for #2 is less clear.

I think it's clear that the encoding should be UTF-8 in this case because using legacy CP1251 would cause loss of data and won't solve mojibake as I'll demonstrate below. Note that CP1251 is hardly ever used as a file encoding even on Windows - I know this because I actually was a Russian Windows user in the past. For example, usage of this encoding for websites dropped from 4.3% to 0.9% in the last 10 years (https://w3techs.com/technologies/history_overview/character_encoding/ms/y) and continues to drop. Putting aside the web, if you look at a Windows application that works with text such as Notepad, you'll notice that even with Russian localization UTF-8 is the default:

image.png

Notepad doesn't even list CP1251 explicitly as an encoding option! You can specify the "ANSI" encoding which will give you CP1251 if you happen to run Windows in Russian (or another language where CP1251 is the default codepage). However, this won't be compatible with the terminal encoding which is CP866 (https://en.wikipedia.org/wiki/Code_page_866). So if you try to display a file written in CP1251 in a terminal (or a Windows system with the different codepage) you'll get mojibake:

image.png

My choice of Windows-1251 for an example scenario was motivated solely by the use of Russian characters in the example in the paper.  The concerns apply equally to any of the other Windows code pages, or to EBCDIC in other environments.

Yes, Notepad switched to UTF-8 as the default encoding last year (and my corporate laptop just received the OS update that includes that change two days ago!)  I don't think the Notepad example is particularly relevant.  There is no particular expectation that a .txt file was produced by a program running on the local machine, so the local code page isn't a particularly good default in any case.

We agree on working around the issues with the Windows console encoding by bypassing it via direct writes to native console interfaces.  The paper proposals doing that bypass only for UTF-8, but I would like to see that done when the execution encoding is non-UTF-8 as well.  If I write a version of the Windows 'type' command as you used it above, call it 'cat', compile it without Microsoft's /utf-8 option, then I would like it to still do the right thing; not the behavior you illustrated above.


You would also pay a performance penalty of transcoding to make this data loss and mojibake possible.
That is true only if the execution/literal encoding and the run-time encoding do not match (I'm ignoring the overhead imposed by checking that they match; since I/O is involved here, I doubt that overhead would be measurable).

While it is clear that #2 should be UTF-8, #3 is slightly less obvious. However there are two observations here:

1. We won't solve the mojibake problem by switching from UTF-8 to CP1251.
2. Most files are already in UTF-8 so the interoperability with programs using legacy codepages is already there, e.g. you'll get the same problem if you do `grep test.txt`.

I think the best we can do in #3 is to be consistent with common application defaults and use UTF-8 when the user asks for it (with /utf8 or some other mechanism) and not try doing a magic transcoding since the latter won't work anyway and will only cause the data loss and performance penalty.

I think it would be useful if the paper summarized the encoding behavior for the surveyed print statements in section 5.

At least some of them behave as I proposed.  For example, the 'print()' methods of Java's 'PrintStream' class transcode to the locale sensitive run-time encoding by default ('java.lang.System.out' can be modified to point to a 'PrintStream' instance explicitly created to target UTF-8).  Likewise, C#'s 'Console.write()' transcodes to the active code page by default ('System.OutputEncoding' can be used to change the default encoding).  Similarly for Perl and Python (2 and 3).

C, Fortran, Go, and Rust all write bytes.  For Go and Rust, that means defacto UTF-8 since string literals are UTF-8.

I haven't been able to find good documentation for Swift.


> To make that last point a bit more concrete, consider the greet | grep "Привет" example above.  If greet produces UTF-8 in a Windows-1251 environment, then a user will have to explicitly deal with the encoding differences,

Unfortunately your example won't work with CP1251 either:

image.png
(I used findstr since grep is uncommon in Windows but the idea is the same.)
Assuming test.txt is UTF-8 encoded, that is correct; this is a straightforward case of mojibake.  I believe that the file vs pipe distinction is an important one; I don't think the encoding implications are the same in both cases.

> Some other nit-picky items

I'll address them in the next revision, thanks!

Thanks!

A few other items that I would like to see the paper discuss:

Writing bytes.  For example, filenames.  I don't think there is a right answer for filenames; some valid filenames cannot be displayed accurately in any well-formed encoding.  Perhaps a 'formatter' specialization should be provided for std::filesystem::path?  Proposing something like that is likely subject matter for a different paper, but I think it would be helpful for this paper to discuss it.

Transcoding errors.  The Windows native console interface requires UTF-16 (assuming use of WriteConsoleW() as I believe is used by FMT).  That means transcoding the std::print() input from UTF-8 to UTF-16.  What happens if the UTF-8 input is ill-formed?

Tom.


Cheers,
Victor

On Sun, Nov 22, 2020 at 9:33 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

SG16 began reviewing P2093R2 in our recent telecon and will continue review in our next telecon scheduled for December 9th.

The following reflects my personal thoughts on this proposal.

First, I'm excited by this proposal as I think it presents an opportunity to correct for some mistakes made in the past.  In particular, this is a chance to get character encoding right such that, for the first time, C++ code like the following could be written simply and portably (well, almost, use of universal-character-names will be required until UTF-8 encoded source files are truly portable; efforts are under way):

int main() {
  std::print("👋 🌎"); // Hello world in the universal language of emoji (U+1F44B U+1F30E)
}

(and yes, I know our mail list archives will mess up the encoding; that is another battle for another day)

I agree with the default output stream being stdout as opposed to std::cout or its associated std::streambuf.  The former better preserves compatibility with other stream oriented formatting facilities; the latter two suffer from private buffering, localization, and conversion services (such services must be integrated or synchronized at a lower level in order for multiple formatting facilities to coexist peacefully).  It is clear to me how the proposed interface can be extended to support wchar_t, char8_t, char16_t, and char32_t in the future if output is written directly to a C (or POSIX) stream, but it is not at all clear to me how that could be done correctly for wide C++ streams; char has won when it comes to I/O interfaces and there is no expectation of that changing any time soon.

The paper notes that P1885 would provide an improvement over the is_utf8() method of encoding determination.  I agree, but there are multiple ways in which it could be used to provide improvements and I'm not sure which capabilities Victor has in mind (we haven't discussed this in SG16 yet).  It could be used as a simple replacement for the is_utf() implementation.  For example:

constexpr bool is_utf8() {
  return text_encoding::literal() == text_encoding(text_encoding::id::UTF8);
}

However, P1885 could also be used to detect the system (run-time) encoding such that output could then be transcoded to match.  This is the possibility I alluded to above about this being an opportunity to get character encoding right.

Consider the following program invocations in a Russian Windows environment with a default code page of Windows-1251 where greet is a C++ program compiled so that the execution encoding is UTF-8 (e.g., via the Visual C++ /utf-8 option) and where it writes the Russian greeting (to a Greek friend) example from the paper, std::print("Привет, κόσμος!").

# An invocation that writes to the console.
> greet                  #1

# An invocation that writes to a file.
> greet > file.txt       #2

# An invocation that writes to a pipe.
> greet | grep "Привет"  #3

The desired behavior for #1 is clear; regardless of the system (run-time) encoding, the goal is for the console to display the intended characters.  The execution encoding is known.  If the console encoding is also known or can be specified, then getting this right is a straight forward case of transcoding to the desired encoding.

The desired behavior for #2 is less clear.  The Reasonable encoding options are UTF-8 and Windows-1251.  Both are reasonable options, but the latter will not be able to represent the full output accurately as some of the Greek characters are not available in Windows-1251 and will therefore be substituted in some way.  But if UTF-8 is produced and the next program that reads the file consumes it as Windows-1251, then the accuracy provided by UTF-8 won't matter anyway.  Only the user is in a position of knowing what the desired outcome is.

The desired behavior for #3 is more clear.  For grep to work as intended, the encoding of the input and the pattern must match or both converted to a common encoding.  In the absence of explicit direction, grep must assume that the input and pattern are both encoded as Windows-1251.  Assuming UTF-8 is not an option because the command line used for the invocation (that contains the pattern) is Windows-1251 encoded.

Reliable and standard interfaces exist to determine when a stream is directed to a terminal/console; POSIX specifies isatty() as noted in the paper.  It is generally possible to determine if a stream corresponds to a file or a pipe as well, but that isn't the extent of stream types that exist.  There are also sockets, FIFOs, and other arbitrary character devices.  I believe it is reasonable to differentiate behavior for a terminal/console, but I think attempting to differentiate behavior for other kinds of streams would be a recipe for surprising and difficult to explain behavior.

The approach taken in the paper is, if writing to a Unicode capable terminal/console and the literal (execution) encoding is UTF-8, then use native interfaces as necessary to ensure the correct characters appear on the console; otherwise, just write the characters to the stream.  This suffices to address #1 above (for the specific case of UTF-8), but it doesn't help with other encodings, nor does it help to improve the situation for #2 or #3.

The model that I believe produces the least surprises and is therefore the easiest to use reliably is one in which a program uses whatever internal encoding its programmers prefer (one can think of the execution/literal encoding as the internal encoding) and then transcodes to the system/run-time encoding on I/O boundaries.  Thus, a program that uses UTF-8 as the execution/literal encoding would produce Windows-1251 output in the non-terminal/console scenarios described above.  This differs from the model described in the paper (UTF-8 output would be produced in that model).

I expect some people reading this to take the position that if it isn't UTF-8 then it is wrong.  My response is that mojibake is even more wrong.  The unfortunate reality is that there are several important ecosystems that are not yet, and may never be, able to migrate to UTF-8 as the system/run-time encoding.  Maintaining a clear separation between internal encoding and external encoding enables correct behavior without having to recompile programs to choose a different literal/execution encoding.  As existing ecosystems migrate their system/run-time encoding to UTF-8, programs written in this way will transparently migrate with them.

To make that last point a bit more concrete, consider the greet | grep "Привет" example above.  If greet produces UTF-8 in a Windows-1251 environment, then a user will have to explicitly deal with the encoding differences, perhaps by inserting a conversion operation as in greet | iconv -f utf-8 -t windows-1251 | grep "Привет".  The problem with this is, if the system/run-time encoding changes in the future, then the explicit conversion will introduce mojibake.  Thus, such workarounds become an impediment to UTF-8 migration.

The behavior I want to see adopted for std::print() is:

  1. When writing directly to a terminal/console, exploit native interfaces as necessary for text to be displayed correctly.
  2. Otherwise, write output encoded to match the system/run-time encoding; the encoding that P1885 indicates via text_encoding::system().

Some other nit-picky items:

  • Section 6, "Unicode" states that the vprint_unicode() function is exposition-only, but it and vprint_nonunicode() are both present in the proposed wording with no indication of being exposition-only.
  • Section 6, "Unicode" discusses use of the Visual C++ /utf-8 option.  This section is incorrect in stating that both the source and literal (execution) encoding must be UTF-8 for is_utf8() to return true; the source encoding is not relevant.  Only the /execution-charset:utf-8 option is needed (the /utf-8 option implies both /source-charset:utf-8 and /execution-charset:utf-8).
  • It may be worth noting in the paper that overloads could additionally be provided to support writing directly to POSIX file descriptors as is done by the POSIX dprintf() interface (which can be quite useful to write directly to a socket, pipe, or file).

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16