Hello,
Thanks for the mail Tom.

Here are my thoughts on the subject.
We (The C++ committee) do not get to dictate what the environment we run in.

The only thing we can decide is whether to support running in some environments. And maybe the amount of grace with which we crash.

I think your list of things that produce strings is helpful. Let's go over it.

Command line arguments and Environment variables

These things have different natures on different platforms.
Bytes on posix, UTF-16. On Windows (or WTF-16, not sure)

These things come from a parent process which would have been called execve/CreateProcess and similar APIs.
Specifically on POSIX there is no real expectation that these things are text at all, with the caveat that '=' will be used to separate environment variable name/value.

We can't (shouldn't) ask these platforms to provide APIs to feed UTF-8 data to C++ programs.
But anyway these are external data and should not be trusted.

we could have a function

int main(int argc, char8_t** args, char8_t** env);

And honestly, I wish we had. I think I made a ticket somewhere.
But, beyond the fact that we'd need to involve a lot of committees and implementers to get that done, its semantics are interesting to think about: what happens when the parameters are not clean UTF-8?

- The program never starts (execve fails)
- The program terminates immediately
- main is run anyway - with or without being UB. This will force users to verify  that they have valid utf-8 and at this point, what have we gained (compared to asking users to decode the data themselves)?

Locale supplied text

This is a non problem. Locale is a function that takes a string and returns a string. That under our current model locales and encodings are conflated does not constitute a fundamental limitation.
We just need a better locale library (The challenge there comes from the sheer amount of work required to make that happen).

And we should separate fundamental issues that stems from the environment not being utf-8 from C++ specific issues that are related to the set of facilities in <locale> that we are definitively in position to fix (by providing alternatives that do not conflate encoding and locale)

The default devices associated with stdin, stdout, and stderr (e.g., the terminal/console encoding assuming no redirection of the streams).

This again, is not under our control.
A parent process feeds a data stream that we read from, and we write data to a stream read by some process.
The best we can do is document the post and preconditions of a given C++ program.
By conventions, UNIX programs should abide by LC_*
Some projects, like Qt, have post-and-preconditions that they get and produce UTF-8 data, and if the child/parent process can't cope... what happens happens.

Windows here is interesting, because it has special files that can be bound to stdin/out/err: Consoles.
Consoles support Unicode, by the mean of WriteConsoleW/ReadConsoleW or by regular read/write functions. When doing so, the console will perform conversions to/fromUnicode using a console-specific narrow encoding that probably defaults to something archaic (we should test that!).

I don't see that Windows having specific text-aware devices for console is something C++ should require of all implementations.
I don't know if this specific Windows architecture says anything about the text model for C++.

In particular, a lot of engineering is necessary to make a Unicode console, for text rendering, selection, etc, and I don't know if that's something IBM has interest to work on?
That IBM can render some ascii characters does not necessarily translate to Unicode support being possible on their system.

But, it is true that on windows calling SetConsole{Output}CP(CP_UTF8) would solve the windows problem.

File names

File names are stored in file systems and we can't tell everyone to please use a filesystem that uses Unicode.
I think there are a few issues there
It is always fun to think about how these things interact (can you open a path whose name comes from a command line argument). I wish we were a bit more pragmatic about these things. That there exists paths with embedded nulls for examples doesn't mean that the standard should sacrifice the common use case to make this work

Text file contents

Strictly a contract between the applications that handle a given file.
Most files - most i/o really cannot be trusted, which implies the existence of verification/decoding functions.
But this is not part of the environment.

Function and file names encoded in the __FILE__ macro, the __func__ predefined variable, and in std::source_location objects.

These have no specified encoding, which... is probably a mistake. I don't remember why, but there are probably minutes, easily addressed either way. They should behave as ordinary string literals

Character and string literals.
This applies to both ordinary and wide literals, actually

I saved the most interesting one for the end.
And by interesting, I mean contentious (apparently).

What C calls character functions have an expectation of text encoding,
and so passing _any_ string to these functions result in funny results.

I am very careful to not use the terms "precondition" and "ub" here as
we still don't have consensus that text errors are errors (*sigh*).

String literals are but a special case of strings that may be in a different encoding from the one expected.
The one thing that is special about string literals is that they may be interpreted at compile time, and it should not be observable (this is mitigated by the fact that there is no standard constexpr text handling function as of today).

Beyond that string literals don't really exist.
Any piece of data is a memcpy away from losing information about its provenance.

const char* a = "こんにちは";
print(a); //1
std::string b{a};
print(b); //2

1 & 2 better produce the same output!

So, in practice, there is an assumption that string literals interpreted at runtime conserve their semantics. From there passing a string literal to any function at runtime that assumes a different encoding is no mUy Bueno.

This implies that whatever the encoding used by these functions is a super set of the literal encoding.

Uncomfortably, we observe that this is not always the case in practice.
Would it be that all C++ programs are sometimes no muy bueno?

And I think your idea is an attempt at fixing that.

I think we should recognize the current situation first.

How string literals and text-encoding related/character/locale-specific functions relate?
What happens when setlocale is called and changes the encoding?

When we contend these scenario and come to the inevitable conclusion that it cannot possibly work, we will be left with a few options:
And so we should
As such I think trying to define "a UTF-8" environment doesn't give us much as long as we have to also specify the behavior of other environments.

Especially as it is not clear from what you propose
And most of your observations relate to windows. Question is then, what is Microsoft willing to do?
There are other platforms that have mismatch between literal encodings and what is used by character functions at runtime. What do we do there?
Are these implementers interested in improving the situation?

Are we willing to, for example
For input (args, env, stdin, etc), the most reasonable solution might be to decode them to UTF-8 (from LC_ALL associated encoding), instead of trying to do things the other way around (But first we need something along the lines of P1629).

There is a lot of teaching to be done too.
The standard should better specify these things though, that would go a long way.

And as Victor demonstrated, output is one thing we do have control over so that's actually a non issue on most systems. But on systems where it is an issue, transcoding *from* unicode is not an option so... tough luck?

In the end what I would recommend we do
Corentin












On Wed, Jul 28, 2021 at 8:31 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

P2093R7 and earlier revisions rely on the choice of UTF-8 for the literal encoding as a proxy indication that the program will run in a Unicode environment with the intent of basing the behavior of the proposed std::print() function on that choice.

SG16 discussed this and other concerns at length during its May 12th, May 26th, June 9th, and June 23rd 2021 telecons.  Consensus on this matter remains weak.  At least some of the concerns raised about basing behavior on the choice of literal encoding includes:

  1. The choice of literal encoding has historically had no effect on the encodings used at run-time.  For example, encoding sensitive functions like mbstowcs() and mbrtowc() do not alter their behavior based on the choice of literal encoding, nor is the encoding used for locale provided text based on it.
  2. The proposed design does not treat all encodings equally; UTF-8 is treated differently than other commonly used encodings like Windows-1252.
  3. The literal encoding may differ across translation units.
  4. Given the following program fragment that contains only ASCII characters in string literals, its behavior would differ if the literal encoding is UTF-8 vs some other ASCII-based encoding regardless of whether that choice affects the data produced by get_some_text().
    std::print("{}", get_some_text());

These concerns and the lack of good consensus has prompted me to look for alternative design possibilities that may lead to a solution with stronger consensus.  This post explores one possibility.

SG16 recently approved P2295R5 and its proposed requirement for an implementation-defined mechanism to specify that source files are UTF-8 encoded.  This approach reflects existing practice in Microsoft Visual C++ via its /source-charset:utf-8 option, GCC via its -finput-charset=utf-8 option, and Clang's default behavior.  Perhaps we can likewise require an implementation-defined mechanism to specify that a program be run in a UTF-8 environment.

What constitutes a UTF-8 environment for a C++ program?  I think of an ideal UTF-8 environment as one where the following are all (ostensibly) UTF-8 encoded:

  1. Ordinary character and string literals.
  2. Function and file names encoded in the __FILE__ macro, the __func__ predefined variable, and in std::source_location objects.
  3. Command line arguments.
  4. Environment variable names.
  5. Environment variable values.
  6. Locale supplied text.
  7. The default devices associated with stdin, stdout, and stderr (e.g., the terminal/console encoding assuming no redirection of the streams).
  8. File names.
  9. Text file contents.

In practice, no implementation is in a position to guarantee well-formed UTF-8 for all of the above.  That suggests that there isn't a single notion of a portable UTF-8 environment, but rather a spectrum.  For example, file names may typically be UTF-8 encoded, but not enforced; different text files may be differently encoded; environment variables may hold binary data.  That is all ok; the goal is to establish expectations, not obviate the need for error handling or special cases.

If the standard were to define a UTF-8 environment, then each of the above could be considered conformance rules for which an implementation could document their conformance; similarly to what we recently did for P1949 and conformance with UAX #31.

Taking this back to P2093.  With a specification for a UTF-8 environment and an implementation-defined mechanism to opt-in to it, the special behavior we've been discussing for std::print() could be tied to it instead of to the choice of literal encoding.

However, I think we can do better.

Corentin created the github repo at https://github.com/cor3ntin/utf8-windows-demo to demonstrate how to build a program that uses UTF-8 at run-time and that can successfully write UTF-8 encoded text to the Windows console without having to use the stream bypass technique documented in P2093R7.  Instead of bypassing the stream, it explicitly sets the encoding of the console and uses an application manifest to run the program with the Active Code Page (ACP) set to UTF-8.  The latter has the effect that command line options, environment variable names and values, locale supplied text, and file names will all be provided in UTF-8.  Combined with the Visual C++ /execution-charset:utf-8 option, a program built in this way will run in an environment that closely matches the UTF-8 environment I described above.

It turns out that the ability to build a C++ program that runs in something like a UTF-8 environment already matches existing practice for common platforms:

  • On Windows:
    • As Corentin's work demonstrates, programs on Windows can force the ACP to UTF-8 by linking with an appropriate manifest file; this opts a program into using UTF-8 for command line options, environment variables, locale supplied text, and file names.
    • The console/terminal encoding can be set to UTF-8 by calling SetConsoleCP() and SetConsoleOutputCP().
    • The literal encoding can be set to UTF-8 by compiling with Visual C++ and the /execution-charset:utf-8 option.
  • On Linux/UNIX:
    • Running in a UTF-8 environment is already standard practice.
  • On z/OS:
    • IBM supports targeting an "enhanced ASCII" run-time environment that implicitly converts between ASCII and EBCDIC.  Though ASCII is the only encoding supported at present, this feature could potentially provide a basis for supporting a UTF-8 environment in the future.

The existing opt-in mechanisms are less than ideal; particularly the need for explicit function calls on Windows to set the console encoding.  It may be that implementors would be willing to make improvements.

There are a number of details that would need to be worked out.  Some examples:

  • On POSIX systems, what would it mean to run a program built to target a UTF-8 environment in an environment with LC_ALL set, e.g., zh_HK.big5hkscs?  Should that be UB?  Should the .big5hkscs property be ignored?  Should we specify that the implementation implicitly transcode?
  • On POSIX systems, localedef can be used to define a locale with its own character set and character classifications.  Can implementations reasonably reason about the encoding of such locales?

Comments and questions would be appreciated.  Is this a direction worth pursuing?

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16