sg16: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 28 Jul 2021 12:11:01 +0200

Hello,
Thanks for the mail Tom.

Here are my thoughts on the subject.
We (The C++ committee) do not get to dictate what the environment we run in.

The only thing we can decide is whether to support running in some
environments. And maybe the amount of grace with which we crash.

I think your list of things that produce strings is helpful. Let's go over
it.

*Command line arguments and Environment variables*

These things have different natures on different platforms.
Bytes on posix, UTF-16. On Windows (or WTF-16, not sure)

These things come from a parent process which would have been called
execve/CreateProcess and similar APIs.
Specifically on POSIX there is no real expectation that these things are
text at all, with the caveat that '=' will be used to separate environment
variable name/value.

We can't (shouldn't) ask these platforms to provide APIs to feed UTF-8 data
to C++ programs.
But anyway these are external data and should not be trusted.

we could have a function

int main(int argc, char8_t** args, char8_t** env);

And honestly, I wish we had. I think I made a ticket somewhere.
But, beyond the fact that we'd need to involve a lot of committees and
implementers to get that done, its semantics are interesting to think
about: what happens when the parameters are not clean UTF-8?

- The program never starts (execve fails)
- The program terminates immediately
- main is run anyway - with or without being UB. This will force users to
verify that they have valid utf-8 and at this point, what have we gained
(compared to asking users to decode the data themselves)?

*Locale supplied text*

This is a non problem. Locale is a function that takes a string and returns
a string. That under our current model locales and encodings are conflated
does not constitute a fundamental limitation.
We just need a better locale library (The challenge there comes from the
sheer amount of work required to make that happen).

And we should separate fundamental issues that stems from the environment
not being utf-8 from C++ specific issues that are related to the set of
facilities in <locale> that we are definitively in position to fix (by
providing alternatives that do not conflate encoding and locale)

*The default devices associated with stdin, stdout, and stderr (e.g., the
terminal/console encoding assuming no redirection of the streams).*

This again, is not under our control.
A parent process feeds a data stream that we read from, and we write data
to a stream read by some process.
The best we can do is document the post and preconditions of a given C++
program.
By conventions, UNIX programs should abide by LC_*
Some projects, like Qt, have post-and-preconditions that they get and
produce UTF-8 data, and if the child/parent process can't cope... what
happens happens.

Windows here is interesting, because it has special files that can be bound
to stdin/out/err: Consoles.
Consoles support Unicode, by the mean of WriteConsoleW/ReadConsoleW or by
regular read/write functions. When doing so, the console will perform
conversions to/fromUnicode using a console-specific narrow encoding that
probably defaults to something archaic (we should test that!).

I don't see that Windows having specific text-aware devices for console is
something C++ should require of all implementations.
I don't know if this specific Windows architecture says anything about the
text model for C++.

In particular, a lot of engineering is necessary to make a Unicode console,
for text rendering, selection, etc, and I don't know if that's something
IBM has interest to work on?
That IBM can render some ascii characters does not necessarily translate to
Unicode support being possible on their system.

But, it is true that on windows calling SetConsole{Output}CP(CP_UTF8) would
solve the windows problem.

*File names*

File names are stored in file systems and we can't tell everyone to please
use a filesystem that uses Unicode.
I think there are a few issues there

   - create_file(u8"嘿") cannot work portably. This can be a runtime error,
   if we can detect the encoding of the filesystem, if any (which isn't
   actually always possible, but it can be faked well enough). I think one of
   the issue currently with paths is that there is no requirements that we
   feed valid utf to these functions
   - In an attempt to be very generic, there is no requirement that paths
   are valid text.The specification of the conversation functions probably
   need work http://eel.is/c++draft/fs.class.path#fs.path.type.cvt.

It is always fun to think about how these things interact (can you open a
path whose name comes from a command line argument). I wish we were a bit
more pragmatic about these things. That there exists paths with
embedded nulls for examples doesn't mean that the standard should sacrifice
the common use case to make this work

*Text file contents*

Strictly a contract between the applications that handle a given file.
Most files - most i/o really cannot be trusted, which implies the existence
of verification/decoding functions.
But this is not part of the environment.

*Function and file names encoded in the __FILE__ macro,
the __func__ predefined variable, and in std::source_location objects.*

These have no specified encoding, which... is probably a mistake. I don't
remember why, but there are probably minutes, easily addressed either way.
They should behave as ordinary string literals

*Character and string literals.*
This applies to both ordinary and wide literals, actually

I saved the most interesting one for the end.
And by interesting, I mean contentious (apparently).

What C calls character functions have an expectation of text encoding,
and so passing _any_ string to these functions result in funny results.

I am very careful to not use the terms "precondition" and "ub" here as
we still don't have consensus that text errors are errors (*sigh*).

String literals are but a special case of strings that may be in a
different encoding from the one expected.
The one thing that is special about string literals is that they may be
interpreted at compile time, and it should not be observable (this is
mitigated by the fact that there is no standard constexpr text handling
function as of today).

Beyond that string literals don't really exist.
Any piece of data is a memcpy away from losing information about its
provenance.

const char* a = "こんにちは";
print(a); //1
std::string b{a};
print(b); //2

1 & 2 better produce the same output!

So, in practice, there is an assumption that string literals interpreted at
runtime conserve their semantics. From there passing a string literal to
any function at runtime that assumes a different encoding is no mUy Bueno.

This implies that whatever the encoding used by these functions is a super
set of the literal encoding.

Uncomfortably, we observe that this is not always the case in practice.
Would it be that all C++ programs are sometimes no muy bueno?

And I think your idea is an attempt at fixing that.

I think we should recognize the current situation first.

How string literals and text-encoding related/character/locale-specific
functions relate?
What happens when setlocale is called and changes the encoding?

When we contend these scenario and come to the inevitable conclusion that
it cannot possibly work, we will be left with a few options:

   - Force users to transcode their string literals everywhere. I am hoping
   we will find that this solution is not very c++-y (performance wise), and a
   bit user hostile.
   - Make string literals that are not in the common subset of all
   encodings supported by the target platform ill-formed. For some platforms
   that is the empty set.
   - Improve UTF-8 support. I am not sure that last one belongs to the
   standard though. I mean, I would love for C++ to mandate
   args/env/stdin/stdout to be utf-8, but I expect I'd be banned from WG21.
   But it is a path that has been chosen by many programming languages, and
   not without good reasons.

And so we should

   - Try not to regress the current state of things
   - Try to improve it where possible - aka windows.

As such I think trying to define "a UTF-8" environment doesn't give us much
as long as we have to also specify the behavior of other environments.

Especially as it is not clear from what you propose

   - What happens in a non-utf-8 environment?
   - How does a user opt-in to that utf-8 environment?
   - What happens if a program is designed to run in a utf8 environment and
   then doesn't?

And most of your observations relate to windows. Question is then, what is
Microsoft willing to do?

   - Would Microsoft be willing to implement print as desired without the
   need for WG21 to write special wording for them?
   - Would Microsoft be willing to set the active code page to CP_UTF8
   under C++23 mode by default?
   - Would they be willing to provide a linker fag to do that? Will users
   understand that flag?

There are other platforms that have mismatch between literal encodings and
what is used by character functions at runtime. What do we do there?
Are these implementers interested in improving the situation?

Are we willing to, for example

   - Deprecate calling setlocale with a different encoding ?
   - Stop calling setlocale(LC_ALL, "C") before main?

For input (args, env, stdin, etc), the most reasonable solution might be to
decode them to UTF-8 (from LC_ALL associated encoding), instead of trying
to do things the other way around (But first we need something along the
lines of P1629).

There is a lot of teaching to be done too.
The standard should better specify these things though, that would go a
long way.

And as Victor demonstrated, output is one thing we do have control over so
that's actually a non issue on most systems. But on systems where it is an
issue, transcoding *from* unicode is not an option so... tough luck?

In the end what I would recommend we do

   - Provide way to decode/check inputs
   - Use Unicode output where available
   - Improve the specification of text functions to clearly state pre/post
   conditions
   - Deprecate most of <locale>
   - Work with vendors to increase utf8 adoption where possible

Corentin

On Wed, Jul 28, 2021 at 8:31 AM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> P2093R7 <https://wg21.link/p2093r7> and earlier revisions rely on the
> choice of UTF-8 for the literal encoding as a proxy indication that the
> program will run in a Unicode environment with the intent of basing the
> behavior of the proposed std::print() function on that choice.
>
> SG16 discussed this and other concerns at length during its May 12th
> <https://github.com/sg16-unicode/sg16-meetings#may-12th-2021>, May 26th
> <https://github.com/sg16-unicode/sg16-meetings#may-26th-2021>, June 9th
> <https://github.com/sg16-unicode/sg16-meetings#june-9th-2021>, and June
> 23rd <https://github.com/sg16-unicode/sg16-meetings#june-23rd-2021> 2021
> telecons. Consensus on this matter remains weak. At least some of the
> concerns raised about basing behavior on the choice of literal encoding
> includes:
>
> 1. The choice of literal encoding has historically had no effect on
> the encodings used at run-time. For example, encoding sensitive functions
> like mbstowcs() and mbrtowc() do not alter their behavior based on the
> choice of literal encoding, nor is the encoding used for locale provided
> text based on it.
> 2. The proposed design does not treat all encodings equally; UTF-8 is
> treated differently than other commonly used encodings like Windows-1252.
> 3. The literal encoding may differ across translation units.
> 4. Given the following program fragment that contains only ASCII
> characters in string literals, its behavior would differ if the literal
> encoding is UTF-8 vs some other ASCII-based encoding regardless of whether
> that choice affects the data produced by get_some_text().
> std::print("{}", get_some_text());
>
> These concerns and the lack of good consensus has prompted me to look for
> alternative design possibilities that may lead to a solution with stronger
> consensus. This post explores one possibility.
>
> SG16 recently approved P2295R5 <https://wg21.link/p2295r5> and its
> proposed requirement for an implementation-defined mechanism to specify
> that source files are UTF-8 encoded. This approach reflects existing
> practice in Microsoft Visual C++ via its /source-charset:utf-8 option,
> GCC via its -finput-charset=utf-8 option, and Clang's default behavior.
> Perhaps we can likewise require an implementation-defined mechanism to
> specify that a program be run in a UTF-8 environment.
>
> What constitutes a UTF-8 environment for a C++ program? I think of an
> ideal UTF-8 environment as one where the following are all (ostensibly)
> UTF-8 encoded:
>
> 1. Ordinary character and string literals.
> 2. Function and file names encoded in the __FILE__ macro, the __func__
> predefined variable, and in std::source_location objects.
> 3. Command line arguments.
> 4. Environment variable names.
> 5. Environment variable values.
> 6. Locale supplied text.
> 7. The default devices associated with stdin, stdout, and stderr
> (e.g., the terminal/console encoding assuming no redirection of the
> streams).
> 8. File names.
> 9. Text file contents.
>
> In practice, no implementation is in a position to guarantee well-formed
> UTF-8 for all of the above. That suggests that there isn't a single notion
> of a portable UTF-8 environment, but rather a spectrum. For example, file
> names may typically be UTF-8 encoded, but not enforced; different text
> files may be differently encoded; environment variables may hold binary
> data. That is all ok; the goal is to establish expectations, not obviate
> the need for error handling or special cases.
>
> If the standard were to define a UTF-8 environment, then each of the above
> could be considered conformance rules for which an implementation could
> document their conformance; similarly to what we recently did for P1949
> <https://wg21.link/p1949> and conformance with UAX #31
> <http://www.unicode.org/reports/tr31>.
>
> Taking this back to P2093 <https://wg21.link/p2093>. With a
> specification for a UTF-8 environment and an implementation-defined
> mechanism to opt-in to it, the special behavior we've been discussing for
> std::print() could be tied to it instead of to the choice of literal
> encoding.
>
> However, I think we can do better.
>
> Corentin created the github repo at
> https://github.com/cor3ntin/utf8-windows-demo to demonstrate how to build
> a program that uses UTF-8 at run-time and that can successfully write UTF-8
> encoded text to the Windows console without having to use the stream bypass
> technique documented in P2093R7 <https://wg21.link/p2093r7>. Instead of
> bypassing the stream, it explicitly sets the encoding of the console and
> uses an application manifest to run the program with the Active Code Page
> (ACP) set to UTF-8. The latter has the effect that command line options,
> environment variable names and values, locale supplied text, and file names
> will all be provided in UTF-8. Combined with the Visual C++
> /execution-charset:utf-8 option, a program built in this way will run in
> an environment that closely matches the UTF-8 environment I described above.
>
> It turns out that the ability to build a C++ program that runs in
> something like a UTF-8 environment already matches existing practice for
> common platforms:
>
> - On Windows:
> - As Corentin's work demonstrates, programs on Windows can force
> the ACP to UTF-8 by linking with an appropriate manifest file; this opts a
> program into using UTF-8 for command line options, environment variables,
> locale supplied text, and file names.
> - The console/terminal encoding can be set to UTF-8 by calling
> SetConsoleCP() and SetConsoleOutputCP().
> - The literal encoding can be set to UTF-8 by compiling with Visual
> C++ and the /execution-charset:utf-8 option.
> - On Linux/UNIX:
> - Running in a UTF-8 environment is already standard practice.
> - On z/OS:
> - IBM supports targeting an "enhanced ASCII" run-time environment
> that implicitly converts between ASCII and EBCDIC. Though ASCII is the
> only encoding supported at present, this feature could potentially provide
> a basis for supporting a UTF-8 environment in the future.
>
> The existing opt-in mechanisms are less than ideal; particularly the need
> for explicit function calls on Windows to set the console encoding. It may
> be that implementors would be willing to make improvements.
>
> There are a number of details that would need to be worked out. Some
> examples:
>
> - On POSIX systems, what would it mean to run a program built to
> target a UTF-8 environment in an environment with LC_ALL set, e.g.,
> zh_HK.big5hkscs? Should that be UB? Should the .big5hkscs property
> be ignored? Should we specify that the implementation implicitly transcode?
> - On POSIX systems, localedef can be used to define a locale with its
> own character set and character classifications. Can implementations
> reasonably reason about the encoding of such locales?
>
> Comments and questions would be appreciated. Is this a direction worth
> pursuing?
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-07-28 05:11:16