sg16: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Charlie Barto <Charles.Barto_at_[hidden]>
Date: Wed, 28 Jul 2021 23:09:32 +0000

> These things have different natures on different platforms.
> Bytes on posix, UTF-16. On Windows (or WTF-16, not sure)

WTF-16 is not an encoding, and is not named in the WTF-8 spec document. Windows command line argument (only one!) and environment variables are sequences of shorts, theoretically in platform byte order, although I think windows has only ever really supported little endian machines. The WTF-8 document calls this "potentially ill-formed UTF-16" if it's intended to be interpreted as UTF-16 text, and applications in standard C++ will _sometimes_ interpret the parameters and variables as UTF-16 and sometimes not. In some cases it may be interpreted as UCS-2 and sometimes as a sequence of narrow characters in some codepage zero extended to 16-bits. I have no idea if you can get in a situation where the shell/command processor will give the program a sequence of zero extended UTF-8 code units.

I think the UCS-2/UTF-16 distinction is controlled by _UNICODE, at least for the transcoding into arguments.

Windows does not distinguish between multiple command line arguments, programs just get one big block of text, not multiple arguments from the kernel. The CRT splits this block into multiple arguments before calling main() (or wmain()). This is a property of the NT kernel, not just windows (although the "subsystem" could always do some kind of quoting / splitting).

because standard C++ programs always use main(), the parameters are _always_ interpreted as some kind of text, because the CRT will go through and split the command line into separate arguments depending on control/quoting characters such as '\', '"', and '''.

It's not totally clear if the quotes are interpreted in the active codepage or as invariant (always 0x22). backslash is _always_ invariant in all windows codepages because it's the path separator. Someone should test this.

> int main(int argc, char8_t** args, char8_t** env)

Yeah I think anything like this should be specified to be WTF-8, even on posix making them actual utf-8 would break file path arguments. With WTF-8 you can round trip to the original sequence of potentially ill formed utf-16 code units.

I'm not 100% sure exactly how the crt transcodes things for parameters when you manifest for UTF-8, It _probably_ calls WideCharToMultiByte which doesn't result in wtf-8, it will error or emit replacements if I remember correctly. It may be difficult to change this behavior for backward compatibility reasons.

An alternative is for the standard to specify a signature for main() that always takes parameters in the platform "native" manner (kinda like _tmain). I don't know if this should include having argc always equal 2 with all arguments as one block in argv[1] on windows.

> But, it is true that on windows calling SetConsole{Output}CP(CP_UTF8) would solve the windows problem

It's really hard for the standard to depend on this happening, we can't have standard functions demand it, because that would mean they'd either fail if it's unset (the default, and unlikely to change due to compatibility) or they'd have to take a process global lock to set the console codepage (not good!). We'd probably be pretty strongly opposed to standard features that would require us to say "/std:c++26 makes the default utf-8, set by the CRT on startup" because that adds a ton of friction to the upgrade process, especially for folks who want to use the feature from a dll built in such a c++26 mode (it would be pretty rude of a dll to change your console encoding when it loaded wouldn't it!)

> create_file(u8"嘿") cannot work portably. This can be a runtime error, if we can detect the encoding of the filesystem, if any (which isn't actually always possible, but it can be faked well enough). I think one of the issue currently with paths is that there is no requirements that we feed valid utf to these functions

This absolutely can work portably. For filesystems that store filenames in sequences of 16-bit shorts they would just widen to UTF-16 and use that. For filesystems that store stuff in sequences of bytes they can just write the bytes out. It's a mistake for create_file to ever try and transcode to something that doesn't have a mapping from every Unicode codepoint. It can even be portable for filesystems that store filenames in a way that's not 8-bit clean, (let's say as a sequence of 6-bit bytes). In that case they could do something like using NUL (or perhaps the path separator) to start a shift encoding for things that don't fit in their smaller bytes.

The thing that's not portable is using create_file(u8"嘿") and expecting it to open an existing file that was created using some other mapping to the actual byte sequence stored. All this means is that to really be portable you need to provide at least one "create_file" overload that takes the actual native path type as determined by the kernel (since the kernel itself had better do all the conversions from its internal representation to the filesystem's representation in the same way all the time).

> Function and file names encoded in the __FILE__ macro, the __func__ predefined variable, and in std::source_location objects.

Can't specify for __FILE__, while the part of the filename you write in an include directive needs to have a lossless conversion to the execution character set the full path can be in any encoding, and actually it can be in _multiple_ different encodings, as long as the path separator is invariant in all of them.

not sure about __func__

> What C calls character functions have an expectation of text encoding

Most have an expectation of just specific properties of the encoding. For example that the thing encoded as "0x0" is, in fact, a terminator. The expectations of each are different. It's fine if some STL functions require their parameters to be encoded as UTF-8. Some way for users to assert that their normal "string" is actually utf-8 would probably be required. It would be nice if u8string/char8_t and friends were appropriate for this task, but it's really inconvenient for different encodings to use different, essentially unrelated types.

> Would Microsoft be willing to implement print as desired without the need for WG21 to write special wording for them?

it depends on if it's implementable. If the committee adopts something like the proposed "transcode to UTF-8 but don't set the console CP" then we can do that, requiring us to set the console CP in std::print would be a problem for the reasons mentioned above.

> Would Microsoft be willing to set the active code page to CP_UTF8 under C++23 mode by default?

Probably not. For the reasons mentioned above (it's extremely unfriendly for DLLs, and will cause substantial friction for users upgrading to a new c++ standard version). For anyone distributing DLLs, either libraries or plugins, requiring the UTF-8 codepage for C++23 mode would essentially be telling them they could not upgrade until all their users upgraded/changed their character set. This is the case for both the active code page and the console code page (which are not the same).

> Would they be willing to provide a linker flag to do that? Will users understand that flag?

Adding a linker option that sets the manifest option to turn on CP_UTF8 as the active code page is a good idea (something similar to the existing #pragma comment(linker, "/MANIFESTDEPENDENCY") ), there are issues around how it propagates and what happens when building a DLL, however. Anyway, the standard can't really depend on such an option.

> There are other platforms that have mismatch between literal encodings and what is used by character functions at runtime. What do we do there?
Are these implementers interested in improving the situation?

This is basically why I'm uncomfortable with adding more and more stuff that depends on the literal encoding, especially when dealing with things that are not literals. There are a lot of strings from the operating system that are just sequences of bytes that are usually strings, but really just binary data, when dealing with them you just need to use robust decoders and rely on the user, and having the library transcode all over the place can make it much, much harder to write correct programs in the presence of such data. In particular I would, in general, like to be able to take a filename as part of a named parameter (like "--file=...") and have that parameter be able to represent any possible file on the filesystem and have my program be able to refer to it (open it, read from it, etc). The usual suspect is files on windows that have unpaired surrogates in their names. If you do transcoding it's very easy to make these files unopenable.

Windows filenames can actually contain embedded NUL characters as well as far as the kernel is concerned, no filesystems that come with windows allow this, but a third-party filesystem might. Same with embedded forward slashes in filenames. One could write a filesystem driver/dokany driver that simply conjures up as many cursed filenames as possible.

> Provide way to decode/check inputs
- Yes but only Unicode encoding forms, making implementations provide an entire non-unicode transcoding and detection library is a lot.
- Probably no need to include encoding or encoding scheme _detection_ (via heuristics), for example detecting if a string is UTF-16 little endian or big endian via frequency heuristics
- it would be nice to have a standard way to transcode between utf-8/16/32 in a way that isn't broken by design like codecvt is. It would be nice if such a mechanism also included transcoding between WTF-8 and potentially ill-formed UTF-16 (not sure what this should do when going to UTF-32). Such a mechanism should also include options to select if invalid code unit sequences produce replacement characters or errors (not in the WTF case, ofc).
- providing functions for transcoding to/from other Unicode encodings (UTF-1, UTF-7, UTF-EBCDIC, GB18030, BOCU-1, CESU-8, SCSU, etc) is probably not necessary for the standard, but the mechanism should probably be able to support them by basically just adding more functions.
- different schemes of the same form probably don't need to be supported, just using the "system" one is probably fine
- it's natural to want converting iterators.
- I think this is the goal of ztd.text, although I've not talked to JeanHeyd Meneide about his plans for that library.

> Use Unicode output where available
While this is nice, it might not be a great idea when we need to munge / do a lot of "stuff" to get the Unicode output to work. Maybe it's better to just provide output interfaces that pass data through to the kernel without any modifications.

> Improve the specification of text functions to clearly state pre/post conditions
yes.
printf format string parsing in most libraries probably depends on "%" being invariant in all supported character sets (it's even invariant between ascii and ebcdic). I also don't think "%" (0x25) appears as a trailing byte in any common multi-byte encoding, unlike "{" and "}", not 100% sure there.

If we're going to have some global setting for what the encoding of strings and string_views is supposed to be we need something better than locale. Maybe ztd.text's basic_text strategy is right here, maybe not.

> Deprecate most of <locale>
I mean yes, It's not that bad to have "locales specified by string that may or may not contain various properties", but the troubles with encodings and with the properties that can't represent multi-byte characters are pretty annoying. Actually, for std::format I would not have minded if locale support was omitted entirely from the standard format specifiers, relying instead on user defined formatters. It does bother me a little that we keep adding new functionality that depends on <locale>, even knowing we'll be (hopefully) replacing it someday. Especially because the replacement probably won't have exactly the same set of locales, and will probably have different values for some locale related data (in particular ones where the current locale facet can't deal with multi-byte characters).

> Work with vendors to increase utf8 adoption where possible

Yes, although the real problem is Unicode adoption, GB18030 and UTF-16 don't really cause that many problems (although admittedly GB18030 is a much more annoying form for many algorithms that UTF-8, and maybe some standard library text functions will require that the string is an a self-synchronizing Unicode encoding form).

A final thought:

parameterizing literally every program on some platform string type that's 8bit on unix and 16bit on windows _can_ actually work, given the correct API and transcoding facilities.

Received on 2021-07-28 18:09:38