C++ Logo

SG16

Advanced search

Subject: Review of P2093R2: Formatted output
From: Tom Honermann (tom_at_[hidden])
Date: 2020-11-22 23:33:00


SG16 began reviewing P2093R2 <https://wg21.link/p2093r2> in our recent
telecon
<https://github.com/sg16-unicode/sg16-meetings#november-11th-2020> and
will continue review in our next telecon scheduled for December 9th.

The following reflects my personal thoughts on this proposal.

First, I'm excited by this proposal as I think it presents an
opportunity to correct for some mistakes made in the past.  In
particular, this is a chance to get character encoding right such that,
for the first time, C++ code like the following could be written simply
and portably (well, almost, use of universal-character-names will be
required until UTF-8 encoded source files are truly portable; efforts
are under way):

    int main() {
     Â  std::print("👋 🌎"); // Hello world in the universal language of
    emoji (U+1F44BU+1F30E)
    }

(and yes, I know our mail list archives will mess up the encoding; that
is another battle for another day)

I agree with the default output stream being stdout as opposed to
std::cout or its associated std::streambuf. The former better preserves
compatibility with other stream oriented formatting facilities; the
latter two suffer from private buffering, localization, and conversion
services (such services must be integrated or synchronized at a lower
level in order for multiple formatting facilities to coexist
peacefully).  It is clear to me how the proposed interface can be
extended to support wchar_t, char8_t, char16_t, and char32_t in the
future if output is written directly to a C (or POSIX) stream, but it is
not at all clear to me how that could be done correctly for wide C++
streams; char has won when it comes to I/O interfaces and there is no
expectation of that changing any time soon.

The paper notes that P1885 <https://wg21.link/p1885> would provide an
improvement over the is_utf8() method of encoding determination.  I
agree, but there are multiple ways in which it could be used to provide
improvements and I'm not sure which capabilities Victor has in mind (we
haven't discussed this in SG16 yet).  It could be used as a simple
replacement for the is_utf() implementation.  For example:

    constexpr bool is_utf8() {
     Â  return text_encoding::literal() ==
    text_encoding(text_encoding::id::UTF8);
    }

However, P1885 could also be used to detect the system (run-time)
encoding such that output could then be transcoded to match.  This is
the possibility I alluded to above about this being an opportunity to
get character encoding right.

Consider the following program invocations in a Russian Windows
environment with a default code page of Windows-1251 where greet is a
C++ program compiled so that the execution encoding is UTF-8 (e.g., via
the Visual C++ /utf-8 option) and where it writes the Russian greeting
(to a Greek friend) example from the paper, std::print("Привет, κόσμος!").

    # An invocation that writes to the console.
> greet                 #1

    # An invocation that writes to a file.
> greet > file.txt       #2

    # An invocation that writes to a pipe.
> greet | grep "Привет"  #3

The desired behavior for #1 is clear; regardless of the system
(run-time) encoding, the goal is for the console to display the intended
characters.  The execution encoding is known.  If the console encoding
is also known or can be specified, then getting this right is a straight
forward case of transcoding to the desired encoding.

The desired behavior for #2 is less clear.  The Reasonable encoding
options are UTF-8 and Windows-1251.  Both are reasonable options, but
the latter will not be able to represent the full output accurately as
some of the Greek characters are not available in Windows-1251 and will
therefore be substituted in some way.  But if UTF-8 is produced and the
next program that reads the file consumes it as Windows-1251, then the
accuracy provided by UTF-8 won't matter anyway.  Only the user is in a
position of knowing what the desired outcome is.

The desired behavior for #3 is more clear.  For grep to work as
intended, the encoding of the input and the pattern must match or both
converted to a common encoding.  In the absence of explicit direction,
grep must assume that the input and pattern are both encoded as
Windows-1251.  Assuming UTF-8 is not an option because the command line
used for the invocation (that contains the pattern) is Windows-1251 encoded.

Reliable and standard interfaces exist to determine when a stream is
directed to a terminal/console; POSIX specifies isatty()
<https://pubs.opengroup.org/onlinepubs/009695399/functions/isatty.html>
as noted in the paper.  It is generally possible to determine if a
stream corresponds to a file or a pipe as well, but that isn't the
extent of stream types that exist.  There are also sockets, FIFOs, and
other arbitrary character devices.  I believe it is reasonable to
differentiate behavior for a terminal/console, but I think attempting to
differentiate behavior for other kinds of streams would be a recipe for
surprising and difficult to explain behavior.

The approach taken in the paper is, if writing to a Unicode capable
terminal/console and the literal (execution) encoding is UTF-8, then use
native interfaces as necessary to ensure the correct characters appear
on the console; otherwise, just write the characters to the stream. 
This suffices to address #1 above (for the specific case of UTF-8), but
it doesn't help with other encodings, nor does it help to improve the
situation for #2 or #3.

The model that I believe produces the least surprises and is therefore
the easiest to use reliably is one in which a program uses whatever
internal encoding its programmers prefer (one can think of the
execution/literal encoding as the internal encoding) and then transcodes
to the system/run-time encoding on I/O boundaries.  Thus, a program that
uses UTF-8 as the execution/literal encoding would produce Windows-1251
output in the non-terminal/console scenarios described above.  This
differs from the model described in the paper (UTF-8 output would be
produced in that model).

I expect some people reading this to take the position that if it isn't
UTF-8 then it is wrong.  My response is that mojibake is even more
wrong.  The unfortunate reality is that there are several important
ecosystems that are not yet, and may never be, able to migrate to UTF-8
as the system/run-time encoding. Maintaining a clear separation between
internal encoding and external encoding enables correct behavior without
having to recompile programs to choose a different literal/execution
encoding.  As existing ecosystems migrate their system/run-time encoding
to UTF-8, programs written in this way will transparently migrate with them.

To make that last point a bit more concrete, consider the greet | grep
"Привет" example above.  If greet produces UTF-8 in a Windows-1251
environment, then a user will have to explicitly deal with the encoding
differences, perhaps by inserting a conversion operation as in greet |
iconv -f utf-8 -t windows-1251 | grep "Привет".  The problem with this
is, if the system/run-time encoding changes in the future, then the
explicit conversion will introduce mojibake.  Thus, such workarounds
become an impediment to UTF-8 migration.

The behavior I want to see adopted for std::print() is:

 1. When writing directly to a terminal/console, exploit native
    interfaces as necessary for text to be displayed correctly.
 2. Otherwise, write output encoded to match the system/run-time
    encoding; the encoding that P1885 indicates via text_encoding::system().

Some other nit-picky items:

  * Section 6, "Unicode" states that the vprint_unicode() function is
    exposition-only, but it and vprint_nonunicode() are both present in
    the proposed wording with no indication of being exposition-only.
  * Section 6, "Unicode" discusses use of the Visual C++ /utf-8 option. 
    This section is incorrect in stating that both the source and
    literal (execution) encoding must be UTF-8 for is_utf8() to return
    true; the source encoding is not relevant.  Only the
    /execution-charset:utf-8 option is needed (the /utf-8 option implies
    both /source-charset:utf-8 and /execution-charset:utf-8).
  * It may be worth noting in the paper that overloads could
    additionally be provided to support writing directly to POSIX file
    descriptors as is done by the POSIX dprintf() interface
    <https://pubs.opengroup.org/onlinepubs/9699919799/functions/dprintf.html>
    (which can be quite useful to write directly to a socket, pipe, or
    file).

Tom.



SG16 list run by sg16-owner@lists.isocpp.org