On Sun, Nov 22, 2020 at 9:33 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

SG16 began reviewing P2093R2 in our recent telecon and will continue review in our next telecon scheduled for December 9th.

The following reflects my personal thoughts on this proposal.

First, I'm excited by this proposal as I think it presents an opportunity to correct for some mistakes made in the past. In particular, this is a chance to get character encoding right such that, for the first time, C++ code like the following could be written simply and portably (well, almost, use of universal-character-names will be required until UTF-8 encoded source files are truly portable; efforts are under way):

int main() { std::print("👋 🌎"); // Hello world in the universal language of emoji (U+1F44BU+1F30E) }

(and yes, I know our mail list archives will mess up the encoding; that is another battle for another day)

I agree with the default output stream being stdout as opposed to std::cout or its associated std::streambuf. The former better preserves compatibility with other stream oriented formatting facilities; the latter two suffer from private buffering, localization, and conversion services (such services must be integrated or synchronized at a lower level in order for multiple formatting facilities to coexist peacefully). It is clear to me how the proposed interface can be extended to support wchar_t, char8_t, char16_t, and char32_t in the future if output is written directly to a C (or POSIX) stream, but it is not at all clear to me how that could be done correctly for wide C++ streams; char has won when it comes to I/O interfaces and there is no expectation of that changing any time soon.

The paper notes that P1885 would provide an improvement over the is_utf8() method of encoding determination. I agree, but there are multiple ways in which it could be used to provide improvements and I'm not sure which capabilities Victor has in mind (we haven't discussed this in SG16 yet). It could be used as a simple replacement for the is_utf() implementation. For example:

constexpr bool is_utf8() { return text_encoding::literal() == text_encoding(text_encoding::id::UTF8);}

However, P1885 could also be used to detect the system (run-time) encoding such that output could then be transcoded to match. This is the possibility I alluded to above about this being an opportunity to get character encoding right.
Consider the following program invocations in a Russian Windows environment with a default code page of Windows-1251 where greet is a C++ program compiled so that the execution encoding is UTF-8 (e.g., via the Visual C++ /utf-8 option) and where it writes the Russian greeting (to a Greek friend) example from the paper, std::print("Привет, κόσμος!").

# An invocation that writes to the console.> greet #1 # An invocation that writes to a file.> greet > file.txt #2 # An invocation that writes to a pipe. > greet | grep "Привет" #3

The desired behavior for #1 is clear; regardless of the system (run-time) encoding, the goal is for the console to display the intended characters. The execution encoding is known. If the console encoding is also known or can be specified, then getting this right is a straight forward case of transcoding to the desired encoding.

The desired behavior for #2 is less clear. The Reasonable encoding options are UTF-8 and Windows-1251. Both are reasonable options, but the latter will not be able to represent the full output accurately as some of the Greek characters are not available in Windows-1251 and will therefore be substituted in some way. But if UTF-8 is produced and the next program that reads the file consumes it as Windows-1251, then the accuracy provided by UTF-8 won't matter anyway. Only the user is in a position of knowing what the desired outcome is.

The desired behavior for #3 is more clear. For grep to work as intended, the encoding of the input and the pattern must match or both converted to a common encoding. In the absence of explicit direction, grep must assume that the input and pattern are both encoded as Windows-1251. Assuming UTF-8 is not an option because the command line used for the invocation (that contains the pattern) is Windows-1251 encoded.

Reliable and standard interfaces exist to determine when a stream is directed to a terminal/console; POSIX specifies isatty() as noted in the paper. It is generally possible to determine if a stream corresponds to a file or a pipe as well, but that isn't the extent of stream types that exist. There are also sockets, FIFOs, and other arbitrary character devices. I believe it is reasonable to differentiate behavior for a terminal/console, but I think attempting to differentiate behavior for other kinds of streams would be a recipe for surprising and difficult to explain behavior.

The approach taken in the paper is, if writing to a Unicode capable terminal/console and the literal (execution) encoding is UTF-8, then use native interfaces as necessary to ensure the correct characters appear on the console; otherwise, just write the characters to the stream. This suffices to address #1 above (for the specific case of UTF-8), but it doesn't help with other encodings, nor does it help to improve the situation for #2 or #3.

The model that I believe produces the least surprises and is therefore the easiest to use reliably is one in which a program uses whatever internal encoding its programmers prefer (one can think of the execution/literal encoding as the internal encoding) and then transcodes to the system/run-time encoding on I/O boundaries. Thus, a program that uses UTF-8 as the execution/literal encoding would produce Windows-1251 output in the non-terminal/console scenarios described above. This differs from the model described in the paper (UTF-8 output would be produced in that model).

I expect some people reading this to take the position that if it isn't UTF-8 then it is wrong. My response is that mojibake is even more wrong. The unfortunate reality is that there are several important ecosystems that are not yet, and may never be, able to migrate to UTF-8 as the system/run-time encoding. Maintaining a clear separation between internal encoding and external encoding enables correct behavior without having to recompile programs to choose a different literal/execution encoding. As existing ecosystems migrate their system/run-time encoding to UTF-8, programs written in this way will transparently migrate with them.

To make that last point a bit more concrete, consider the greet | grep "Привет" example above. If greet produces UTF-8 in a Windows-1251 environment, then a user will have to explicitly deal with the encoding differences, perhaps by inserting a conversion operation as in greet | iconv -f utf-8 -t windows-1251 | grep "Привет". The problem with this is, if the system/run-time encoding changes in the future, then the explicit conversion will introduce mojibake. Thus, such workarounds become an impediment to UTF-8 migration.

The behavior I want to see adopted for std::print() is:

When writing directly to a terminal/console, exploit native interfaces as necessary for text to be displayed correctly.

Otherwise, write output encoded to match the system/run-time encoding; the encoding that P1885 indicates via text_encoding::system().

Some other nit-picky items:

Section 6, "Unicode" states that the vprint_unicode() function is exposition-only, but it and vprint_nonunicode() are both present in the proposed wording with no indication of being exposition-only.

Section 6, "Unicode" discusses use of the Visual C++ /utf-8 option. This section is incorrect in stating that both the source and literal (execution) encoding must be UTF-8 for is_utf8() to return true; the source encoding is not relevant. Only the /execution-charset:utf-8 option is needed (the /utf-8 option implies both /source-charset:utf-8 and /execution-charset:utf-8).

It may be worth noting in the paper that overloads could additionally be provided to support writing directly to POSIX file descriptors as is done by the POSIX dprintf() interface (which can be quite useful to write directly to a socket, pipe, or file).

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16