C++ Logo

sg16

Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 30 Jul 2021 09:11:33 +0200
On Fri, Jul 30, 2021 at 12:33 AM Charlie Barto <Charles.Barto_at_[hidden]>
wrote:

> > Would that be strictly better than the status quo?
>
> Upon further reflection (aka: sleep) I think it would be at most as good
> as the status quo, and probably worse (discounting that if the encoding
> isn't actually utf-8 we probably shouldn't use char8_t). It would be worse
> since it would mandate the broken behavior (lossy transcoding) that our
> implementation currently does in all cases.
>
> > WTF-8 cannot be assumed to be valid and so some checking has to be done
> by the users, at which point the difference compared to giving them a char*
> and having them decode it is negligible.
>
> yes, and in any case if we wanted to ensure the parameters were actually
> utf-8 the runtime startup code would have to do that check. If users are
> checking they defer or omit validity checks in some cases. This can be
> important, to check that the string is actually well formed you need to
> _actually look_ at every single byte and then do a sequence of probably a
> few dozen instructions to decide if it's valid. Sometimes it's OK to just
> assume it _is_ valid if you don't do anything that actually requires the
> whole thing be valid. For example you may linearly search for delimiters
> then parse text between them, as long as you are careful about validating
> the text between the delimiters it doesn't matter if some other part of the
> string is bogus, and you never have to execute the instructions that would
> check those other bits of the string.
>
> It is useful to have a function that converts potentially ill-formed
> utf-16 to wtf-8 and back on windows where UTF-16 encoding is the
> convention.
>
> On linux wtf-8 would actually be "reversable" if you establish a
> convention for aligning the arguments to 2 and then always convert the
> char* array from the kernel from potentially ill-formed utf-16 to wtf-8
> (cast to uint16_t* then convert), this would cause character values to
> differ from the custom, but is no less valid than any other conversion, and
> does round trip. Doing that would not be useful, because the custom is that
> if you want to give a linux program the parameter "A" you encode it as
> "0x4100", not as "0x0041'0000" (all big endian notation)
>
> > Does it follow from the observation that it is technically possible
> under some scenario to form paths with embedded nulls that it's something
> that should be supported here?
> it is possible to form paths with embedded nulls, but they aren't that
> useful. On windows it's _absolutely_ possible for the program argument
> string to contain embedded nulls, but modifying main to support those
> probably isn't a good use of time. (you can kinda pass such parameters to
> linux programs by escaping a null with another null, resulting in a zero
> length argument).
>
> > More seriously, it would be great to have that as an opt-in (independent
> of WG21)
> It's already an opt-in, the switch to opt in is to call
> SetConsoleCP(CP_UTF8) at the start of your program. Maybe a manifest option
> would be a good idea, but again, it's not clear how that should work for
> DLLs.
>
> > I am really not concerned about things that are paths or binary blobs
> here.
> I agree it's not a big deal for format's locale specifiers.
>
> > The issue I think we should try to resolve is what you called "mixed"
> encoding, when literals and runtime-encoding strings have different
> encodings.
>
> I think "mixed" encoding should mean "byte strings (std::string,
> std::string_view, char*, etc) where different subsequences are in different
> encodings. This is different from the encoding of literals perhaps not
> matching the encoding of strings returned from various runtime functions or
> the OS.
>
> yes, this is important in the context of file access functions and the
> entry point. I know I used the example of unpaired surrogates, which is a
> very pathological case. I like that example because it's easy to hold in
> your head and reason about, and if that works it's likely the more
> important cases work too.
>
> More important are folders with invalid utf-16, (made worse on windows by
> how SHGetKnownFolder works, and the lack of anything (documented) like
> openat). Even there std::filesystem is willing to die upon encountering
> such paths, and apparently that doesn't cause too many problems, even with
> home folder names.
>
> > sometimes the arguments will be only ascii... but in the end some users
> will have non ascii characters in both the format strings and the arguments
> and this should be possible.
> And the only possible way that can work in practice is if implementers
> make it easy for them to opt-in to UTF-8.
>
> The only actual requirement for characters in the format string is on the
> representation of control characters and character boundaries, which need
> to match the execution character set. If the execution character set is
> self-synchronizing then you can have arbitrary sequences of bytes, even
> nulls, in the format string and everything will work out just fine, as long
> as none of those byte sequences form a subsequence that is a valid control
> character. UTF-8 is very much not required and programs that use universal
> character names (or the literal character, if supported by the
> implementation) will work just fine in _any_ encoding that is actually a
> unicode encoding form. Further, if the user has non-ascii characters in the
> format string and the arguments, and the encodings don't match I think it's
> reasonable and desirable for the result to be a string with subsequences in
> different encodings. using "format("{}{}{}",a ,b ,c)" as a shorthand for a
> + b + c is reasonable (and notably no characters from the format string end
> up in the output), as is using "format("{}/{}.{}", base, name, extension)"
> to form paths.
>
> Saying that the only way this stuff can work in practice is if you opt
> into UTF-8 is just incorrect. Both examples work totally fine under _any_
> character set. The model of format is string concatenation with some
> options, and it's totally valid to concatenate strings in different
> encodings everywhere that uses byte strings. For languages that assert as a
> precondition that strings are valid utf-8 (maybe in c++ with char8_t
> strings) they don't worry about it when concatenating, and don't support
> concatenating byte strings with utf-8-by-construction strings.
>

I disagree there. If the format string contains anything but the common
subset of all encoding (which precludes forward slashes), the resulting
string will not be usable.
And I don't see that limiting the format string to a subset of the basic
latin 1 block would be what users want.

But it's just not format strings, right.
a+b doesn't produce a meaningful result if a is a string constructed at
compile time and b isn't or format("{}{}", "Hallå", argv[0]), etc etc.


>
>
> From: Corentin Jabot <corentinjabot_at_[hidden]>
> Sent: Thursday, July 29, 2021 2:53 AM
> To: Charlie Barto <Charles.Barto_at_[hidden]>
> Cc: sg16_at_[hidden]; Tom Honermann <tom_at_[hidden]>
> Subject: Re: [SG16] A UTF-8 environment specification; an alternative to
> assuming UTF-8 based on choice of literal encoding
>
>
>
> On Thu, Jul 29, 2021 at 1:09 AM Charlie Barto <mailto:
> Charles.Barto_at_[hidden]> wrote:
> > These things have different natures on different platforms.
> > Bytes on posix, UTF-16. On Windows (or WTF-16, not sure)
>
> WTF-16 is not an encoding, and is not named in the WTF-8 spec document.
> Windows command line argument (only one!) and environment variables are
> sequences of shorts, theoretically in platform byte order, although I think
> windows has only ever really supported little endian machines. The WTF-8
> document calls this "potentially ill-formed UTF-16" if it's intended to be
> interpreted as UTF-16 text, and applications in standard C++ will
> _sometimes_ interpret the parameters and variables as UTF-16 and sometimes
> not. In some cases it may be interpreted as UCS-2 and sometimes as a
> sequence of narrow characters in some codepage zero extended to 16-bits. I
> have no idea if you can get in a situation where the shell/command
> processor will give the program a sequence of zero extended UTF-8 code
> units.
>
> I think the UCS-2/UTF-16 distinction is controlled by _UNICODE, at least
> for the transcoding into arguments.
>
> Windows does not distinguish between multiple command line arguments,
> programs just get one big block of text, not multiple arguments from the
> kernel. The CRT splits this block into multiple arguments before calling
> main() (or wmain()). This is a property of the NT kernel, not just windows
> (although the "subsystem" could always do some kind of quoting /
> splitting).
>
> because standard C++ programs always use main(), the parameters are
> _always_ interpreted as some kind of text, because the CRT will go through
> and split the command line into separate arguments depending on
> control/quoting characters such as '\', '"', and '''.
>
> It's not totally clear if the quotes are interpreted in the active
> codepage or as invariant (always 0x22). backslash is _always_ invariant in
> all windows codepages because it's the path separator. Someone should test
> this.
>
> Thanks, this was informative
>
>
>
> > int main(int argc, char8_t** args, char8_t** env)
>
> Yeah I think anything like this should be specified to be WTF-8, even on
> posix making them actual utf-8 would break file path arguments. With WTF-8
> you can round trip to the original sequence of potentially ill formed
> utf-16 code units.
>
> Would that be strictly better than the status quo?
> WTF-8 cannot be assumed to be valid and so some checking has to be done by
> the users, at which point the difference compared to giving them a char*
> and having them decode it is negligible.
> But there is another question here.
> Does it follow from the observation that it is technically possible under
> some scenario to form paths with embedded nulls that it's something that
> should be supported here?
> I am not saying we should deprive Windows users of this capability but if
> we care, they could keep using the existing entry points.
> Same is true in other cases. Native APIs can keep handling these things,
> it's not obvious to me that the standard should!
>
>
> I'm not 100% sure exactly how the crt transcodes things for parameters
> when you manifest for UTF-8, It _probably_ calls WideCharToMultiByte which
> doesn't result in wtf-8, it will error or emit replacements if I remember
> correctly. It may be difficult to change this behavior for backward
> compatibility reasons.
>
> An alternative is for the standard to specify a signature for main() that
> always takes parameters in the platform "native" manner (kinda like
> _tmain). I don't know if this should include having argc always equal 2
> with all arguments as one block in argv[1] on windows.
>
>
> > But, it is true that on windows calling SetConsole{Output}CP(CP_UTF8)
> would solve the windows problem
>
> It's really hard for the standard to depend on this happening, we can't
> have standard functions demand it, because that would mean they'd either
> fail if it's unset (the default, and unlikely to change due to
> compatibility) or they'd have to take a process global lock to set the
> console codepage (not good!). We'd probably be pretty strongly opposed to
> standard features that would require us to say "/std:c++26 makes the
> default utf-8, set by the CRT on startup" because that adds a ton of
> friction to the upgrade process, especially for folks who want to use the
> feature from a dll built in such a c++26 mode (it would be pretty rude of a
> dll to change your console encoding when it loaded wouldn't it!)
>
> And this is why we can't have nice things!
> More seriously, it would be great to have that as an opt-in (independent
> of WG21)
>
>
> > create_file(u8"嘿") cannot work portably. This can be a runtime error, if
> we can detect the encoding of the filesystem, if any (which isn't actually
> always possible, but it can be faked well enough). I think one of the issue
> currently with paths is that there is no requirements that we feed valid
> utf to these functions
>
> This absolutely can work portably. For filesystems that store filenames in
> sequences of 16-bit shorts they would just widen to UTF-16 and use that.
> For filesystems that store stuff in sequences of bytes they can just write
> the bytes out. It's a mistake for create_file to ever try and transcode to
> something that doesn't have a mapping from every Unicode codepoint. It can
> even be portable for filesystems that store filenames in a way that's not
> 8-bit clean, (let's say as a sequence of 6-bit bytes). In that case they
> could do something like using NUL (or perhaps the path separator) to start
> a shift encoding for things that don't fit in their smaller bytes.
>
> The thing that's not portable is using create_file(u8"嘿") and expecting it
> to open an existing file that was created using some other mapping to the
> actual byte sequence stored. All this means is that to really be portable
> you need to provide at least one "create_file" overload that takes the
> actual native path type as determined by the kernel (since the kernel
> itself had better do all the conversions from its internal representation
> to the filesystem's representation in the same way all the time).
>
> > Function and file names encoded in the __FILE__ macro, the __func__
> predefined variable, and in std::source_location objects.
>
> Can't specify for __FILE__, while the part of the filename you write in an
> include directive needs to have a lossless conversion to the execution
> character set the full path can be in any encoding, and actually it can be
> in _multiple_ different encodings, as long as the path separator is
> invariant in all of them.
>
> not sure about __func__
>
> > What C calls character functions have an expectation of text encoding
>
> Most have an expectation of just specific properties of the encoding. For
> example that the thing encoded as "0x0" is, in fact, a terminator. The
> expectations of each are different. It's fine if some STL functions require
> their parameters to be encoded as UTF-8. Some way for users to assert that
> their normal "string" is actually utf-8 would probably be required. It
> would be nice if u8string/char8_t and friends were appropriate for this
> task, but it's really inconvenient for different encodings to use
> different, essentially unrelated types.
>
> > Would Microsoft be willing to implement print as desired without the
> need for WG21 to write special wording for them?
>
> it depends on if it's implementable. If the committee adopts something
> like the proposed "transcode to UTF-8 but don't set the console CP" then we
> can do that, requiring us to set the console CP in std::print would be a
> problem for the reasons mentioned above.
>
> > Would Microsoft be willing to set the active code page to CP_UTF8 under
> C++23 mode by default?
>
> Probably not. For the reasons mentioned above (it's extremely unfriendly
> for DLLs, and will cause substantial friction for users upgrading to a new
> c++ standard version). For anyone distributing DLLs, either libraries or
> plugins, requiring the UTF-8 codepage for C++23 mode would essentially be
> telling them they could not upgrade until all their users upgraded/changed
> their character set. This is the case for both the active code page and the
> console code page (which are not the same).
>
> > Would they be willing to provide a linker flag to do that? Will users
> understand that flag?
>
> Adding a linker option that sets the manifest option to turn on CP_UTF8 as
> the active code page is a good idea (something similar to the existing
> #pragma comment(linker, "/MANIFESTDEPENDENCY") ), there are issues around
> how it propagates and what happens when building a DLL, however. Anyway,
> the standard can't really depend on such an option.
>
> > There are other platforms that have mismatch between literal encodings
> and what is used by character functions at runtime. What do we do there?
> Are these implementers interested in improving the situation?
>
> This is basically why I'm uncomfortable with adding more and more stuff
> that depends on the literal encoding, especially when dealing with things
> that are not literals. There are a lot of strings from the operating system
> that are just sequences of bytes that are usually strings, but really just
> binary data, when dealing with them you just need to use robust decoders
> and rely on the user, and having the library transcode all over the place
> can make it much, much harder to write correct programs in the presence of
> such data. In particular I would, in general, like to be able to take a
> filename as part of a named parameter (like "--file=...") and have that
> parameter be able to represent any possible file on the filesystem and have
> my program be able to refer to it (open it, read from it, etc). The usual
> suspect is files on windows that have unpaired surrogates in their names.
> If you do transcoding it's very easy to make these files unopenable.
>
> I am really not concerned about things that are paths or binary blobs here.
> The issue I think we should try to resolve is what you called "mixed"
> encoding, when literals and runtime-encoding strings have different
> encodings.
> And like, it doesn't really matter in the context of the conversation we
> had yesterday what users are likely to do with format. Sometimes the format
> string will only contain ascii and the arguments not,
> sometimes the arguments will be only ascii... but in the end some users
> will have non ascii characters in both the format strings and the arguments
> and this should be possible.
> And the only possible way that can work in practice is if implementers
> make it easy for them to opt-in to UTF-8.
> Then we can teach them what is text and what is binary blobs.
>
>
> Windows filenames can actually contain embedded NUL characters as well as
> far as the kernel is concerned, no filesystems that come with windows allow
> this, but a third-party filesystem might. Same with embedded forward
> slashes in filenames. One could write a filesystem driver/dokany driver
> that simply conjures up as many cursed filenames as possible.
>
> Again, to which extent do we put on the standard to support cursed
> shenanigans at the expense of everybody else?
>
>
>
> > Provide way to decode/check inputs
> - Yes but only Unicode encoding forms, making implementations provide an
> entire non-unicode transcoding and detection library is a lot.
>
> I think the standard should also support transcoding from (to?) the
> narrow/wide encodings.
>
> - Probably no need to include encoding or encoding scheme _detection_ (via
> heuristics), for example detecting if a string is UTF-16 little endian or
> big endian via frequency heuristics
> - it would be nice to have a standard way to transcode between utf-8/16/32
> in a way that isn't broken by design like codecvt is. It would be nice if
> such a mechanism also included transcoding between WTF-8 and potentially
> ill-formed UTF-16 (not sure what this should do when going to UTF-32). Such
> a mechanism should also include options to select if invalid code unit
> sequences produce replacement characters or errors (not in the WTF case,
> ofc).
> - providing functions for transcoding to/from other Unicode encodings
> (UTF-1, UTF-7, UTF-EBCDIC, GB18030, BOCU-1, CESU-8, SCSU, etc) is probably
> not necessary for the standard, but the mechanism should probably be able
> to support them by basically just adding more functions.
> - different schemes of the same form probably don't need to be supported,
> just using the "system" one is probably fine
> - it's natural to want converting iterators.
> - I think this is the goal of ztd.text, although I've not talked to
> JeanHeyd Meneide about his plans for that library.
>
> > Use Unicode output where available
> While this is nice, it might not be a great idea when we need to munge /
> do a lot of "stuff" to get the Unicode output to work. Maybe it's better to
> just provide output interfaces that pass data through to the kernel without
> any modifications.
>
> > Improve the specification of text functions to clearly state pre/post
> conditions
> yes.
> printf format string parsing in most libraries probably depends on "%"
> being invariant in all supported character sets (it's even invariant
> between ascii and ebcdic). I also don't think "%" (0x25) appears as a
> trailing byte in any common multi-byte encoding, unlike "{" and "}", not
> 100% sure there.
>
> If we're going to have some global setting for what the encoding of
> strings and string_views is supposed to be we need something better than
> locale. Maybe ztd.text's basic_text strategy is right here, maybe not.
>
> > Deprecate most of <locale>
> I mean yes, It's not that bad to have "locales specified by string that
> may or may not contain various properties", but the troubles with encodings
> and with the properties that can't represent multi-byte characters are
> pretty annoying. Actually, for std::format I would not have minded if
> locale support was omitted entirely from the standard format specifiers,
> relying instead on user defined formatters. It does bother me a little that
> we keep adding new functionality that depends on <locale>, even knowing
> we'll be (hopefully) replacing it someday. Especially because the
> replacement probably won't have exactly the same set of locales, and will
> probably have different values for some locale related data (in particular
> ones where the current locale facet can't deal with multi-byte characters).
>
> > Work with vendors to increase utf8 adoption where possible
>
> Yes, although the real problem is Unicode adoption, GB18030 and UTF-16
> don't really cause that many problems (although admittedly GB18030 is a
> much more annoying form for many algorithms that UTF-8, and maybe some
> standard library text functions will require that the string is an a
> self-synchronizing Unicode encoding form).
>
>
> A final thought:
>
> parameterizing literally every program on some platform string type that's
> 8bit on unix and 16bit on windows _can_ actually work, given the correct
> API and transcoding facilities.
>

Received on 2021-07-30 02:11:49