--P2093R7 and earlier revisions rely on the choice of UTF-8 for the literal encoding as a proxy indication that the program will run in a Unicode environment with the intent of basing the behavior of the proposed std::print() function on that choice.
SG16 discussed this and other concerns at length during its May 12th, May 26th, June 9th, and June 23rd 2021 telecons. Consensus on this matter remains weak. At least some of the concerns raised about basing behavior on the choice of literal encoding includes:
- The choice of literal encoding has historically had no effect on the encodings used at run-time. For example, encoding sensitive functions like mbstowcs() and mbrtowc() do not alter their behavior based on the choice of literal encoding, nor is the encoding used for locale provided text based on it.
- The proposed design does not treat all encodings equally; UTF-8 is treated differently than other commonly used encodings like Windows-1252.
- The literal encoding may differ across translation units.
- Given the following program fragment that contains only ASCII characters in string literals, its behavior would differ if the literal encoding is UTF-8 vs some other ASCII-based encoding regardless of whether that choice affects the data produced by get_some_text().
std::print("{}", get_some_text());
These concerns and the lack of good consensus has prompted me to look for alternative design possibilities that may lead to a solution with stronger consensus. This post explores one possibility.
SG16 recently approved P2295R5 and its proposed requirement for an implementation-defined mechanism to specify that source files are UTF-8 encoded. This approach reflects existing practice in Microsoft Visual C++ via its /source-charset:utf-8 option, GCC via its -finput-charset=utf-8 option, and Clang's default behavior. Perhaps we can likewise require an implementation-defined mechanism to specify that a program be run in a UTF-8 environment.
What constitutes a UTF-8 environment for a C++ program? I think of an ideal UTF-8 environment as one where the following are all (ostensibly) UTF-8 encoded:
- Ordinary character and string literals.
- Function and file names encoded in the __FILE__ macro, the __func__ predefined variable, and in std::source_location objects.
- Command line arguments.
- Environment variable names.
- Environment variable values.
- Locale supplied text.
- The default devices associated with stdin, stdout, and stderr (e.g., the terminal/console encoding assuming no redirection of the streams).
- File names.
- Text file contents.
In practice, no implementation is in a position to guarantee well-formed UTF-8 for all of the above. That suggests that there isn't a single notion of a portable UTF-8 environment, but rather a spectrum. For example, file names may typically be UTF-8 encoded, but not enforced; different text files may be differently encoded; environment variables may hold binary data. That is all ok; the goal is to establish expectations, not obviate the need for error handling or special cases.
If the standard were to define a UTF-8 environment, then each of the above could be considered conformance rules for which an implementation could document their conformance; similarly to what we recently did for P1949 and conformance with UAX #31.
Taking this back to P2093. With a specification for a UTF-8 environment and an implementation-defined mechanism to opt-in to it, the special behavior we've been discussing for std::print() could be tied to it instead of to the choice of literal encoding.
However, I think we can do better.
Corentin created the github repo at https://github.com/cor3ntin/utf8-windows-demo to demonstrate how to build a program that uses UTF-8 at run-time and that can successfully write UTF-8 encoded text to the Windows console without having to use the stream bypass technique documented in P2093R7. Instead of bypassing the stream, it explicitly sets the encoding of the console and uses an application manifest to run the program with the Active Code Page (ACP) set to UTF-8. The latter has the effect that command line options, environment variable names and values, locale supplied text, and file names will all be provided in UTF-8. Combined with the Visual C++ /execution-charset:utf-8 option, a program built in this way will run in an environment that closely matches the UTF-8 environment I described above.
It turns out that the ability to build a C++ program that runs in something like a UTF-8 environment already matches existing practice for common platforms:
- On Windows:
- As Corentin's work demonstrates, programs on Windows can force the ACP to UTF-8 by linking with an appropriate manifest file; this opts a program into using UTF-8 for command line options, environment variables, locale supplied text, and file names.
- The console/terminal encoding can be set to UTF-8 by calling SetConsoleCP() and SetConsoleOutputCP().
- The literal encoding can be set to UTF-8 by compiling with Visual C++ and the /execution-charset:utf-8 option.
- On Linux/UNIX:
- Running in a UTF-8 environment is already standard practice.
- On z/OS:
- IBM supports targeting an "enhanced ASCII" run-time environment that implicitly converts between ASCII and EBCDIC. Though ASCII is the only encoding supported at present, this feature could potentially provide a basis for supporting a UTF-8 environment in the future.
The existing opt-in mechanisms are less than ideal; particularly the need for explicit function calls on Windows to set the console encoding. It may be that implementors would be willing to make improvements.
There are a number of details that would need to be worked out. Some examples:
- On POSIX systems, what would it mean to run a program built to target a UTF-8 environment in an environment with LC_ALL set, e.g., zh_HK.big5hkscs? Should that be UB? Should the .big5hkscs property be ignored? Should we specify that the implementation implicitly transcode?
- On POSIX systems, localedef can be used to define a locale with its own character set and character classifications. Can implementations reasonably reason about the encoding of such locales?
Comments and questions would be appreciated. Is this a direction worth pursuing?
Tom.
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16