C++ Logo

sg16

Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 29 Jul 2021 11:53:05 +0200
On Thu, Jul 29, 2021 at 1:09 AM Charlie Barto <Charles.Barto_at_[hidden]>
wrote:

> > These things have different natures on different platforms.
> > Bytes on posix, UTF-16. On Windows (or WTF-16, not sure)
>
> WTF-16 is not an encoding, and is not named in the WTF-8 spec document.
> Windows command line argument (only one!) and environment variables are
> sequences of shorts, theoretically in platform byte order, although I think
> windows has only ever really supported little endian machines. The WTF-8
> document calls this "potentially ill-formed UTF-16" if it's intended to be
> interpreted as UTF-16 text, and applications in standard C++ will
> _sometimes_ interpret the parameters and variables as UTF-16 and sometimes
> not. In some cases it may be interpreted as UCS-2 and sometimes as a
> sequence of narrow characters in some codepage zero extended to 16-bits. I
> have no idea if you can get in a situation where the shell/command
> processor will give the program a sequence of zero extended UTF-8 code
> units.
>
> I think the UCS-2/UTF-16 distinction is controlled by _UNICODE, at least
> for the transcoding into arguments.
>
> Windows does not distinguish between multiple command line arguments,
> programs just get one big block of text, not multiple arguments from the
> kernel. The CRT splits this block into multiple arguments before calling
> main() (or wmain()). This is a property of the NT kernel, not just windows
> (although the "subsystem" could always do some kind of quoting /
> splitting).
>
> because standard C++ programs always use main(), the parameters are
> _always_ interpreted as some kind of text, because the CRT will go through
> and split the command line into separate arguments depending on
> control/quoting characters such as '\', '"', and '''.
>
> It's not totally clear if the quotes are interpreted in the active
> codepage or as invariant (always 0x22). backslash is _always_ invariant in
> all windows codepages because it's the path separator. Someone should test
> this.
>

Thanks, this was informative


>
>
> > int main(int argc, char8_t** args, char8_t** env)
>
> Yeah I think anything like this should be specified to be WTF-8, even on
> posix making them actual utf-8 would break file path arguments. With WTF-8
> you can round trip to the original sequence of potentially ill formed
> utf-16 code units.
>

Would that be strictly better than the status quo?
WTF-8 cannot be assumed to be valid and so some checking has to be done by
the users, at which point the difference compared to giving them a char*
and having them decode it is negligible.
But there is another question here.
Does it follow from the observation that it is technically possible under
some scenario to form paths with embedded nulls that it's something that
should be supported here?
I am not saying we should deprive Windows users of this capability but if
we care, they could keep using the existing entry points.
Same is true in other cases. Native APIs can keep handling these things,
it's not obvious to me that the standard should!


> I'm not 100% sure exactly how the crt transcodes things for parameters
> when you manifest for UTF-8, It _probably_ calls WideCharToMultiByte which
> doesn't result in wtf-8, it will error or emit replacements if I remember
> correctly. It may be difficult to change this behavior for backward
> compatibility reasons.
>
> An alternative is for the standard to specify a signature for main() that
> always takes parameters in the platform "native" manner (kinda like
> _tmain). I don't know if this should include having argc always equal 2
> with all arguments as one block in argv[1] on windows.
>
>
> > But, it is true that on windows calling SetConsole{Output}CP(CP_UTF8)
> would solve the windows problem
>
> It's really hard for the standard to depend on this happening, we can't
> have standard functions demand it, because that would mean they'd either
> fail if it's unset (the default, and unlikely to change due to
> compatibility) or they'd have to take a process global lock to set the
> console codepage (not good!). We'd probably be pretty strongly opposed to
> standard features that would require us to say "/std:c++26 makes the
> default utf-8, set by the CRT on startup" because that adds a ton of
> friction to the upgrade process, especially for folks who want to use the
> feature from a dll built in such a c++26 mode (it would be pretty rude of a
> dll to change your console encoding when it loaded wouldn't it!)
>

And this is why we can't have nice things!
More seriously, it would be great to have that as an opt-in (independent of
WG21)


>
> > create_file(u8"嘿") cannot work portably. This can be a runtime error, if
> we can detect the encoding of the filesystem, if any (which isn't actually
> always possible, but it can be faked well enough). I think one of the issue
> currently with paths is that there is no requirements that we feed valid
> utf to these functions
>
> This absolutely can work portably. For filesystems that store filenames in
> sequences of 16-bit shorts they would just widen to UTF-16 and use that.
> For filesystems that store stuff in sequences of bytes they can just write
> the bytes out. It's a mistake for create_file to ever try and transcode to
> something that doesn't have a mapping from every Unicode codepoint. It can
> even be portable for filesystems that store filenames in a way that's not
> 8-bit clean, (let's say as a sequence of 6-bit bytes). In that case they
> could do something like using NUL (or perhaps the path separator) to start
> a shift encoding for things that don't fit in their smaller bytes.
>
> The thing that's not portable is using create_file(u8"嘿") and expecting it
> to open an existing file that was created using some other mapping to the
> actual byte sequence stored. All this means is that to really be portable
> you need to provide at least one "create_file" overload that takes the
> actual native path type as determined by the kernel (since the kernel
> itself had better do all the conversions from its internal representation
> to the filesystem's representation in the same way all the time).
>
> > Function and file names encoded in the __FILE__ macro, the __func__
> predefined variable, and in std::source_location objects.
>
> Can't specify for __FILE__, while the part of the filename you write in an
> include directive needs to have a lossless conversion to the execution
> character set the full path can be in any encoding, and actually it can be
> in _multiple_ different encodings, as long as the path separator is
> invariant in all of them.
>
> not sure about __func__
>
> > What C calls character functions have an expectation of text encoding
>
> Most have an expectation of just specific properties of the encoding. For
> example that the thing encoded as "0x0" is, in fact, a terminator. The
> expectations of each are different. It's fine if some STL functions require
> their parameters to be encoded as UTF-8. Some way for users to assert that
> their normal "string" is actually utf-8 would probably be required. It
> would be nice if u8string/char8_t and friends were appropriate for this
> task, but it's really inconvenient for different encodings to use
> different, essentially unrelated types.
>
> > Would Microsoft be willing to implement print as desired without the
> need for WG21 to write special wording for them?
>
> it depends on if it's implementable. If the committee adopts something
> like the proposed "transcode to UTF-8 but don't set the console CP" then we
> can do that, requiring us to set the console CP in std::print would be a
> problem for the reasons mentioned above.
>
> > Would Microsoft be willing to set the active code page to CP_UTF8 under
> C++23 mode by default?
>
> Probably not. For the reasons mentioned above (it's extremely unfriendly
> for DLLs, and will cause substantial friction for users upgrading to a new
> c++ standard version). For anyone distributing DLLs, either libraries or
> plugins, requiring the UTF-8 codepage for C++23 mode would essentially be
> telling them they could not upgrade until all their users upgraded/changed
> their character set. This is the case for both the active code page and the
> console code page (which are not the same).
>
> > Would they be willing to provide a linker flag to do that? Will users
> understand that flag?
>
> Adding a linker option that sets the manifest option to turn on CP_UTF8 as
> the active code page is a good idea (something similar to the existing
> #pragma comment(linker, "/MANIFESTDEPENDENCY") ), there are issues around
> how it propagates and what happens when building a DLL, however. Anyway,
> the standard can't really depend on such an option.
>
> > There are other platforms that have mismatch between literal encodings
> and what is used by character functions at runtime. What do we do there?
> Are these implementers interested in improving the situation?
>
> This is basically why I'm uncomfortable with adding more and more stuff
> that depends on the literal encoding, especially when dealing with things
> that are not literals. There are a lot of strings from the operating system
> that are just sequences of bytes that are usually strings, but really just
> binary data, when dealing with them you just need to use robust decoders
> and rely on the user, and having the library transcode all over the place
> can make it much, much harder to write correct programs in the presence of
> such data. In particular I would, in general, like to be able to take a
> filename as part of a named parameter (like "--file=...") and have that
> parameter be able to represent any possible file on the filesystem and have
> my program be able to refer to it (open it, read from it, etc). The usual
> suspect is files on windows that have unpaired surrogates in their names.
> If you do transcoding it's very easy to make these files unopenable.
>

I am really not concerned about things that are paths or binary blobs here.
The issue I think we should try to resolve is what you called "mixed"
encoding, when literals and runtime-encoding strings have different
encodings.
And like, it doesn't really matter in the context of the conversation we
had yesterday what users are likely to do with format. Sometimes the format
string will only contain ascii and the arguments not,
sometimes the arguments will be only ascii... but in the end some users
will have non ascii characters in both the format strings and the arguments
and this should be possible.
And the only possible way that can work in practice is if implementers make
it easy for them to opt-in to UTF-8.
Then we can teach them what is text and what is binary blobs.


>
> Windows filenames can actually contain embedded NUL characters as well as
> far as the kernel is concerned, no filesystems that come with windows allow
> this, but a third-party filesystem might. Same with embedded forward
> slashes in filenames. One could write a filesystem driver/dokany driver
> that simply conjures up as many cursed filenames as possible.


Again, to which extent do we put on the standard to support cursed
shenanigans at the expense of everybody else?


>
>
> > Provide way to decode/check inputs
> - Yes but only Unicode encoding forms, making implementations provide an
> entire non-unicode transcoding and detection library is a lot.


I think the standard should also support transcoding from (to?) the
narrow/wide encodings.


> - Probably no need to include encoding or encoding scheme _detection_ (via
> heuristics), for example detecting if a string is UTF-16 little endian or
> big endian via frequency heuristics
> - it would be nice to have a standard way to transcode between utf-8/16/32
> in a way that isn't broken by design like codecvt is. It would be nice if
> such a mechanism also included transcoding between WTF-8 and potentially
> ill-formed UTF-16 (not sure what this should do when going to UTF-32). Such
> a mechanism should also include options to select if invalid code unit
> sequences produce replacement characters or errors (not in the WTF case,
> ofc).
> - providing functions for transcoding to/from other Unicode encodings
> (UTF-1, UTF-7, UTF-EBCDIC, GB18030, BOCU-1, CESU-8, SCSU, etc) is probably
> not necessary for the standard, but the mechanism should probably be able
> to support them by basically just adding more functions.
> - different schemes of the same form probably don't need to be supported,
> just using the "system" one is probably fine
> - it's natural to want converting iterators.
> - I think this is the goal of ztd.text, although I've not talked to
> JeanHeyd Meneide about his plans for that library.
>
> > Use Unicode output where available
> While this is nice, it might not be a great idea when we need to munge /
> do a lot of "stuff" to get the Unicode output to work. Maybe it's better to
> just provide output interfaces that pass data through to the kernel without
> any modifications.
>
> > Improve the specification of text functions to clearly state pre/post
> conditions
> yes.
> printf format string parsing in most libraries probably depends on "%"
> being invariant in all supported character sets (it's even invariant
> between ascii and ebcdic). I also don't think "%" (0x25) appears as a
> trailing byte in any common multi-byte encoding, unlike "{" and "}", not
> 100% sure there.
>
> If we're going to have some global setting for what the encoding of
> strings and string_views is supposed to be we need something better than
> locale. Maybe ztd.text's basic_text strategy is right here, maybe not.
>
> > Deprecate most of <locale>
> I mean yes, It's not that bad to have "locales specified by string that
> may or may not contain various properties", but the troubles with encodings
> and with the properties that can't represent multi-byte characters are
> pretty annoying. Actually, for std::format I would not have minded if
> locale support was omitted entirely from the standard format specifiers,
> relying instead on user defined formatters. It does bother me a little that
> we keep adding new functionality that depends on <locale>, even knowing
> we'll be (hopefully) replacing it someday. Especially because the
> replacement probably won't have exactly the same set of locales, and will
> probably have different values for some locale related data (in particular
> ones where the current locale facet can't deal with multi-byte characters).
>
> > Work with vendors to increase utf8 adoption where possible
>
> Yes, although the real problem is Unicode adoption, GB18030 and UTF-16
> don't really cause that many problems (although admittedly GB18030 is a
> much more annoying form for many algorithms that UTF-8, and maybe some
> standard library text functions will require that the string is an a
> self-synchronizing Unicode encoding form).
>
>
> A final thought:
>
> parameterizing literally every program on some platform string type that's
> 8bit on unix and 16bit on windows _can_ actually work, given the correct
> API and transcoding facilities.
>

Received on 2021-07-29 04:53:19