sg16: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 28 Jul 2021 02:30:41 -0400

P2093R7 <https://wg21.link/p2093r7> and earlier revisions rely on the
choice of UTF-8 for the literal encoding as a proxy indication that the
program will run in a Unicode environment with the intent of basing the
behavior of the proposed std::print() function on that choice.

SG16 discussed this and other concerns at length during its May 12th
<https://github.com/sg16-unicode/sg16-meetings#may-12th-2021>, May 26th
<https://github.com/sg16-unicode/sg16-meetings#may-26th-2021>, June 9th
<https://github.com/sg16-unicode/sg16-meetings#june-9th-2021>, and June
23rd <https://github.com/sg16-unicode/sg16-meetings#june-23rd-2021> 2021
telecons. Consensus on this matter remains weak. At least some of the
concerns raised about basing behavior on the choice of literal encoding
includes:

1. The choice of literal encoding has historically had no effect on the
    encodings used at run-time. For example, encoding sensitive
    functions like mbstowcs() and mbrtowc() do not alter their behavior
    based on the choice of literal encoding, nor is the encoding used
    for locale provided text based on it.
2. The proposed design does not treat all encodings equally; UTF-8 is
    treated differently than other commonly used encodings like
    Windows-1252.
3. The literal encoding may differ across translation units.
4. Given the following program fragment that contains only ASCII
    characters in string literals, its behavior would differ if the
    literal encoding is UTF-8 vs some other ASCII-based encoding
    regardless of whether that choice affects the data produced by
    get_some_text().
    std::print("{}", get_some_text());

These concerns and the lack of good consensus has prompted me to look
for alternative design possibilities that may lead to a solution with
stronger consensus. This post explores one possibility.

SG16 recently approved P2295R5 <https://wg21.link/p2295r5> and its
proposed requirement for an implementation-defined mechanism to specify
that source files are UTF-8 encoded. This approach reflects existing
practice in Microsoft Visual C++ via its /source-charset:utf-8 option,
GCC via its -finput-charset=utf-8 option, and Clang's default behavior.
Perhaps we can likewise require an implementation-defined mechanism to
specify that a program be run in a UTF-8 environment.

What constitutes a UTF-8 environment for a C++ program? I think of an
ideal UTF-8 environment as one where the following are all (ostensibly)
UTF-8 encoded:

1. Ordinary character and string literals.
2. Function and file names encoded in the __FILE__ macro, the __func__
    predefined variable, and in std::source_location objects.
3. Command line arguments.
4. Environment variable names.
5. Environment variable values.
6. Locale supplied text.
7. The default devices associated with stdin, stdout, and stderr (e.g.,
    the terminal/console encoding assuming no redirection of the streams).
8. File names.
9. Text file contents.

In practice, no implementation is in a position to guarantee well-formed
UTF-8 for all of the above. That suggests that there isn't a single
notion of a portable UTF-8 environment, but rather a spectrum. For
example, file names may typically be UTF-8 encoded, but not enforced;
different text files may be differently encoded; environment variables
may hold binary data. That is all ok; the goal is to establish
expectations, not obviate the need for error handling or special cases.

If the standard were to define a UTF-8 environment, then each of the
above could be considered conformance rules for which an implementation
could document their conformance; similarly to what we recently did for
P1949 <https://wg21.link/p1949> and conformance with UAX #31
<http://www.unicode.org/reports/tr31>.

Taking this back to P2093 <https://wg21.link/p2093>. With a
specification for a UTF-8 environment and an implementation-defined
mechanism to opt-in to it, the special behavior we've been discussing
for std::print() could be tied to it instead of to the choice of literal
encoding.

However, I think we can do better.

Corentin created the github repo at
https://github.com/cor3ntin/utf8-windows-demo
<https://github.com/cor3ntin/utf8-windows-demo> to demonstrate how to
build a program that uses UTF-8 at run-time and that can successfully
write UTF-8 encoded text to the Windows console without having to use
the stream bypass technique documented in P2093R7
<https://wg21.link/p2093r7>. Instead of bypassing the stream, it
explicitly sets the encoding of the console and uses an application
manifest to run the program with the Active Code Page (ACP) set to
UTF-8. The latter has the effect that command line options, environment
variable names and values, locale supplied text, and file names will all
be provided in UTF-8. Combined with the Visual C++
/execution-charset:utf-8 option, a program built in this way will run in
an environment that closely matches the UTF-8 environment I described above.

It turns out that the ability to build a C++ program that runs in
something like a UTF-8 environment already matches existing practice for
common platforms:

  * On Windows:
      o As Corentin's work demonstrates, programs on Windows can force
        the ACP to UTF-8 by linking with an appropriate manifest file;
        this opts a program into using UTF-8 for command line options,
        environment variables, locale supplied text, and file names.
      o The console/terminal encoding can be set to UTF-8 by calling
        SetConsoleCP() and SetConsoleOutputCP().
      o The literal encoding can be set to UTF-8 by compiling with
        Visual C++ and the /execution-charset:utf-8 option.
  * On Linux/UNIX:
      o Running in a UTF-8 environment is already standard practice.
  * On z/OS:
      o IBM supports targeting an "enhanced ASCII" run-time environment
        that implicitly converts between ASCII and EBCDIC. Though ASCII
        is the only encoding supported at present, this feature could
        potentially provide a basis for supporting a UTF-8 environment
        in the future.

The existing opt-in mechanisms are less than ideal; particularly the
need for explicit function calls on Windows to set the console
encoding. It may be that implementors would be willing to make
improvements.

There are a number of details that would need to be worked out. Some
examples:

  * On POSIX systems, what would it mean to run a program built to
    target a UTF-8 environment in an environment with LC_ALL set, e.g.,
    zh_HK.big5hkscs? Should that be UB? Should the .big5hkscs property
    be ignored? Should we specify that the implementation implicitly
    transcode?
  * On POSIX systems, localedef can be used to define a locale with its
    own character set and character classifications. Can
    implementations reasonably reason about the encoding of such locales?

Comments and questions would be appreciated. Is this a direction worth
pursuing?

Tom.

Received on 2021-07-28 01:30:46