C++ Logo

sg16

Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Charlie Barto <Charles.Barto_at_[hidden]>
Date: Thu, 29 Jul 2021 22:32:56 +0000
> Would that be strictly better than the status quo?

Upon further reflection (aka: sleep) I think it would be at most as good as the status quo, and probably worse (discounting that if the encoding isn't actually utf-8 we probably shouldn't use char8_t). It would be worse since it would mandate the broken behavior (lossy transcoding) that our implementation currently does in all cases.

> WTF-8 cannot be assumed to be valid and so some checking has to be done by the users, at which point the difference compared to giving them a char* and having them decode it is negligible.

yes, and in any case if we wanted to ensure the parameters were actually utf-8 the runtime startup code would have to do that check. If users are checking they defer or omit validity checks in some cases. This can be important, to check that the string is actually well formed you need to _actually look_ at every single byte and then do a sequence of probably a few dozen instructions to decide if it's valid. Sometimes it's OK to just assume it _is_ valid if you don't do anything that actually requires the whole thing be valid. For example you may linearly search for delimiters then parse text between them, as long as you are careful about validating the text between the delimiters it doesn't matter if some other part of the string is bogus, and you never have to execute the instructions that would check those other bits of the string.

It is useful to have a function that converts potentially ill-formed utf-16 to wtf-8 and back on windows where UTF-16 encoding is the convention.

On linux wtf-8 would actually be "reversable" if you establish a convention for aligning the arguments to 2 and then always convert the char* array from the kernel from potentially ill-formed utf-16 to wtf-8 (cast to uint16_t* then convert), this would cause character values to differ from the custom, but is no less valid than any other conversion, and does round trip. Doing that would not be useful, because the custom is that if you want to give a linux program the parameter "A" you encode it as "0x4100", not as "0x0041'0000" (all big endian notation)

> Does it follow from the observation that it is technically possible under some scenario to form paths with embedded nulls that it's something that should be supported here?
it is possible to form paths with embedded nulls, but they aren't that useful. On windows it's _absolutely_ possible for the program argument string to contain embedded nulls, but modifying main to support those probably isn't a good use of time. (you can kinda pass such parameters to linux programs by escaping a null with another null, resulting in a zero length argument).

> More seriously, it would be great to have that as an opt-in (independent of WG21)
It's already an opt-in, the switch to opt in is to call SetConsoleCP(CP_UTF8) at the start of your program. Maybe a manifest option would be a good idea, but again, it's not clear how that should work for DLLs.

> I am really not concerned about things that are paths or binary blobs here.
I agree it's not a big deal for format's locale specifiers.

> The issue I think we should try to resolve is what you called "mixed" encoding, when literals and runtime-encoding strings have different encodings.

I think "mixed" encoding should mean "byte strings (std::string, std::string_view, char*, etc) where different subsequences are in different encodings. This is different from the encoding of literals perhaps not matching the encoding of strings returned from various runtime functions or the OS.

yes, this is important in the context of file access functions and the entry point. I know I used the example of unpaired surrogates, which is a very pathological case. I like that example because it's easy to hold in your head and reason about, and if that works it's likely the more important cases work too.

More important are folders with invalid utf-16, (made worse on windows by how SHGetKnownFolder works, and the lack of anything (documented) like openat). Even there std::filesystem is willing to die upon encountering such paths, and apparently that doesn't cause too many problems, even with home folder names.

> sometimes the arguments will be only ascii... but in the end some users will have non ascii characters in both the format strings and the arguments and this should be possible.
And the only possible way that can work in practice is if implementers make it easy for them to opt-in to UTF-8.

The only actual requirement for characters in the format string is on the representation of control characters and character boundaries, which need to match the execution character set. If the execution character set is self-synchronizing then you can have arbitrary sequences of bytes, even nulls, in the format string and everything will work out just fine, as long as none of those byte sequences form a subsequence that is a valid control character. UTF-8 is very much not required and programs that use universal character names (or the literal character, if supported by the implementation) will work just fine in _any_ encoding that is actually a unicode encoding form. Further, if the user has non-ascii characters in the format string and the arguments, and the encodings don't match I think it's reasonable and desirable for the result to be a string with subsequences in different encodings. using "format("{}{}{}",a ,b ,c)" as a shorthand for a + b + c is reasonable (and notably no characters from the format string end up in the output), as is using "format("{}/{}.{}", base, name, extension)" to form paths.

Saying that the only way this stuff can work in practice is if you opt into UTF-8 is just incorrect. Both examples work totally fine under _any_ character set. The model of format is string concatenation with some options, and it's totally valid to concatenate strings in different encodings everywhere that uses byte strings. For languages that assert as a precondition that strings are valid utf-8 (maybe in c++ with char8_t strings) they don't worry about it when concatenating, and don't support concatenating byte strings with utf-8-by-construction strings.


From: Corentin Jabot <corentinjabot_at_[hidden]>
Sent: Thursday, July 29, 2021 2:53 AM
To: Charlie Barto <Charles.Barto_at_microsoft.com>
Cc: sg16_at_[hidden]; Tom Honermann <tom_at_honermann.net>
Subject: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding



On Thu, Jul 29, 2021 at 1:09 AM Charlie Barto <mailto:Charles.Barto_at_microsoft.com> wrote:
> These things have different natures on different platforms.
> Bytes on posix, UTF-16. On Windows (or WTF-16, not sure)

WTF-16 is not an encoding, and is not named in the WTF-8 spec document. Windows command line argument (only one!) and environment variables are sequences of shorts, theoretically in platform byte order, although I think windows has only ever really supported little endian machines. The WTF-8 document calls this "potentially ill-formed UTF-16" if it's intended to be interpreted as UTF-16 text, and applications in standard C++ will _sometimes_ interpret the parameters and variables as UTF-16 and sometimes not. In some cases it may be interpreted as UCS-2 and sometimes as a sequence of narrow characters in some codepage zero extended to 16-bits. I have no idea if you can get in a situation where the shell/command processor will give the program a sequence of zero extended UTF-8 code units.

I think the UCS-2/UTF-16 distinction is controlled by _UNICODE, at least for the transcoding into arguments.

Windows does not distinguish between multiple command line arguments, programs just get one big block of text, not multiple arguments from the kernel. The CRT splits this block into multiple arguments before calling main() (or wmain()). This is a property of the NT kernel, not just windows (although the "subsystem" could always do some kind of quoting / splitting).

because standard C++ programs always use main(), the parameters are _always_ interpreted as some kind of text, because the CRT will go through and split the command line into separate arguments depending on control/quoting characters such as '\', '"', and '''.

It's not totally clear if the quotes are interpreted in the active codepage or as invariant (always 0x22). backslash is _always_ invariant in all windows codepages because it's the path separator. Someone should test this.

Thanks, this was informative
 


> int main(int argc, char8_t** args, char8_t** env)

Yeah I think anything like this should be specified to be WTF-8, even on posix making them actual utf-8 would break file path arguments. With WTF-8 you can round trip to the original sequence of potentially ill formed utf-16 code units.

Would that be strictly better than the status quo?
WTF-8 cannot be assumed to be valid and so some checking has to be done by the users, at which point the difference compared to giving them a char* and having them decode it is negligible.
But there is another question here.
Does it follow from the observation that it is technically possible under some scenario to form paths with embedded nulls that it's something that should be supported here?
I am not saying we should deprive Windows users of this capability but if we care, they could keep using the existing entry points.
Same is true in other cases. Native APIs can keep handling these things, it's not obvious to me that the standard should!


I'm not 100% sure exactly how the crt transcodes things for parameters when you manifest for UTF-8, It _probably_ calls WideCharToMultiByte which doesn't result in wtf-8, it will error or emit replacements if I remember correctly. It may be difficult to change this behavior for backward compatibility reasons.

An alternative is for the standard to specify a signature for main() that always takes parameters in the platform "native" manner (kinda like _tmain). I don't know if this should include having argc always equal 2 with all arguments as one block in argv[1] on windows.


> But, it is true that on windows calling SetConsole{Output}CP(CP_UTF8) would solve the windows problem

It's really hard for the standard to depend on this happening, we can't have standard functions demand it, because that would mean they'd either fail if it's unset (the default, and unlikely to change due to compatibility) or they'd have to take a process global lock to set the console codepage (not good!). We'd probably be pretty strongly opposed to standard features that would require us to say "/std:c++26 makes the default utf-8, set by the CRT on startup" because that adds a ton of friction to the upgrade process, especially for folks who want to use the feature from a dll built in such a c++26 mode (it would be pretty rude of a dll to change your console encoding when it loaded wouldn't it!)

And this is why we can't have nice things!
More seriously, it would be great to have that as an opt-in (independent of WG21)
 

> create_file(u8"嘿") cannot work portably. This can be a runtime error, if we can detect the encoding of the filesystem, if any (which isn't actually always possible, but it can be faked well enough). I think one of the issue currently with paths is that there is no requirements that we feed valid utf to these functions

This absolutely can work portably. For filesystems that store filenames in sequences of 16-bit shorts they would just widen to UTF-16 and use that. For filesystems that store stuff in sequences of bytes they can just write the bytes out. It's a mistake for create_file to ever try and transcode to something that doesn't have a mapping from every Unicode codepoint. It can even be portable for filesystems that store filenames in a way that's not 8-bit clean, (let's say as a sequence of 6-bit bytes). In that case they could do something like using NUL (or perhaps the path separator) to start a shift encoding for things that don't fit in their smaller bytes.

The thing that's not portable is using create_file(u8"嘿") and expecting it to open an existing file that was created using some other mapping to the actual byte sequence stored. All this means is that to really be portable you need to provide at least one "create_file" overload that takes the actual native path type as determined by the kernel (since the kernel itself had better do all the conversions from its internal representation to the filesystem's representation in the same way all the time).

> Function and file names encoded in the __FILE__ macro, the __func__ predefined variable, and in std::source_location objects.

Can't specify for __FILE__, while the part of the filename you write in an include directive needs to have a lossless conversion to the execution character set the full path can be in any encoding, and actually it can be in _multiple_ different encodings, as long as the path separator is invariant in all of them.

not sure about __func__

> What C calls character functions have an expectation of text encoding

Most have an expectation of just specific properties of the encoding. For example that the thing encoded as "0x0" is, in fact, a terminator. The expectations of each are different. It's fine if some STL functions require their parameters to be encoded as UTF-8. Some way for users to assert that their normal "string" is actually utf-8 would probably be required. It would be nice if u8string/char8_t and friends were appropriate for this task, but it's really inconvenient for different encodings to use different, essentially unrelated types.

> Would Microsoft be willing to implement print as desired without the need for WG21 to write special wording for them?

it depends on if it's implementable. If the committee adopts something like the proposed "transcode to UTF-8 but don't set the console CP" then we can do that, requiring us to set the console CP in std::print would be a problem for the reasons mentioned above.

> Would Microsoft be willing to set the active code page to CP_UTF8 under C++23 mode by default?

Probably not. For the reasons mentioned above (it's extremely unfriendly for DLLs, and will cause substantial friction for users upgrading to a new c++ standard version). For anyone distributing DLLs, either libraries or plugins, requiring the UTF-8 codepage for C++23 mode would essentially be telling them they could not upgrade until all their users upgraded/changed their character set. This is the case for both the active code page and the console code page (which are not the same).

> Would they be willing to provide a linker flag to do that? Will users understand that flag?

Adding a linker option that sets the manifest option to turn on CP_UTF8 as the active code page is a good idea (something similar to the existing #pragma comment(linker, "/MANIFESTDEPENDENCY") ), there are issues around how it propagates and what happens when building a DLL, however. Anyway, the standard can't really depend on such an option.

> There are other platforms that have mismatch between literal encodings and what is used by character functions at runtime. What do we do there?
Are these implementers interested in improving the situation?

This is basically why I'm uncomfortable with adding more and more stuff that depends on the literal encoding, especially when dealing with things that are not literals. There are a lot of strings from the operating system that are just sequences of bytes that are usually strings, but really just binary data, when dealing with them you just need to use robust decoders and rely on the user, and having the library transcode all over the place can make it much, much harder to write correct programs in the presence of such data. In particular I would, in general, like to be able to take a filename as part of a named parameter (like "--file=...") and have that parameter be able to represent any possible file on the filesystem and have my program be able to refer to it (open it, read from it, etc). The usual suspect is files on windows that have unpaired surrogates in their names. If you do transcoding it's very easy to make these files unopenable.

I am really not concerned about things that are paths or binary blobs here.
The issue I think we should try to resolve is what you called "mixed" encoding, when literals and runtime-encoding strings have different encodings.
And like, it doesn't really matter in the context of the conversation we had yesterday what users are likely to do with format. Sometimes the format string will only contain ascii and the arguments not,
sometimes the arguments will be only ascii... but in the end some users will have non ascii characters in both the format strings and the arguments and this should be possible.
And the only possible way that can work in practice is if implementers make it easy for them to opt-in to UTF-8.
Then we can teach them what is text and what is binary blobs.
 

Windows filenames can actually contain embedded NUL characters as well as far as the kernel is concerned, no filesystems that come with windows allow this, but a third-party filesystem might. Same with embedded forward slashes in filenames. One could write a filesystem driver/dokany driver that simply conjures up as many cursed filenames as possible.

Again, to which extent do we put on the standard to support cursed shenanigans at the expense of everybody else?
 


>  Provide way to decode/check inputs
- Yes but only Unicode encoding forms, making implementations provide an entire non-unicode transcoding and detection library is a lot.
 
I think the standard should also support transcoding from (to?) the narrow/wide encodings.
 
- Probably no need to include encoding or encoding scheme _detection_ (via heuristics), for example detecting if a string is UTF-16 little endian or big endian via frequency heuristics
- it would be nice to have a standard way to transcode between utf-8/16/32 in a way that isn't broken by design like codecvt is. It would be nice if such a mechanism also included transcoding between WTF-8 and potentially ill-formed UTF-16 (not sure what this should do when going to UTF-32). Such a mechanism should also include options to select if invalid code unit sequences produce replacement characters or errors (not in the WTF case, ofc).
- providing functions for transcoding to/from other Unicode encodings (UTF-1, UTF-7, UTF-EBCDIC, GB18030, BOCU-1, CESU-8, SCSU, etc) is probably not necessary for the standard, but the mechanism should probably be able to support them by basically just adding more functions.
- different schemes of the same form probably don't need to be supported, just using the "system" one is probably fine
- it's natural to want converting iterators.
- I think this is the goal of ztd.text, although I've not talked to JeanHeyd Meneide about his plans for that library.

> Use Unicode output where available
While this is nice, it might not be a great idea when we need to munge / do a lot of "stuff" to get the Unicode output to work. Maybe it's better to just provide output interfaces that pass data through to the kernel without any modifications.

> Improve the specification of text functions to clearly state pre/post conditions
yes.
printf format string parsing in most libraries probably depends on "%" being invariant in all supported character sets (it's even invariant between ascii and ebcdic). I also don't think "%" (0x25) appears as a trailing byte in any common multi-byte encoding, unlike "{" and "}", not 100% sure there.

If we're going to have some global setting for what the encoding of strings and string_views is supposed to be we need something better than locale. Maybe ztd.text's basic_text strategy is right here, maybe not.

> Deprecate most of <locale>
I mean yes, It's not that bad to have "locales specified by string that may or may not contain various properties", but the troubles with encodings and with the properties that can't represent multi-byte characters are pretty annoying. Actually, for std::format I would not have minded if locale support was omitted entirely from the standard format specifiers, relying instead on user defined formatters. It does bother me a little that we keep adding new functionality that depends on <locale>, even knowing we'll be (hopefully) replacing it someday. Especially because the replacement probably won't have exactly the same set of locales, and will probably have different values for some locale related data (in particular ones where the current locale facet can't deal with multi-byte characters).

> Work with vendors to increase utf8 adoption where possible

Yes, although the real problem is Unicode adoption, GB18030 and UTF-16 don't really cause that many problems (although admittedly GB18030 is a much more annoying form for many algorithms that UTF-8, and maybe some standard library text functions will require that the string is an a self-synchronizing Unicode encoding form).


A final thought:

parameterizing literally every program on some platform string type that's 8bit on unix and 16bit on windows _can_ actually work, given the correct API and transcoding facilities.

Received on 2021-07-29 17:33:03