C++ Logo

SG16

Advanced search

Subject: Re: A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding
From: Charlie Barto (Charles.Barto_at_[hidden])
Date: 2021-07-29 21:11:03


> And I'd be careful about standard facilities transporting anything other than UTF=8 in char8_t because that defeats its purpose.

Yeah if we were to provide an entry point for windows that transcoded into WTF-8 the parameters should really be char* or unsigned char*, not char8_t 😃

> WTF-8 offers no benefit whatsoever over the status quo: it's untrusted bags of bytes that the user has to check.

It would offer something on windows (none of the current ways to get narrow strings into your application from our C runtime round trip). Since the current behavior is just implementation defined we could offer that feature just as well on our own, it's a QoI issue. Adding something in the standard would just be so that if someone opts in on windows the same code would work (with no transcoding) on platforms that traffic in narrow (char) byte strings, without any preprocessor work. Right now you need to use wmain/winmain/GetCommandLineW and some preprocessor. Maybe if the QoI issue is corrected everyone on windows can just opt-in and write main() and get things in wtf-8.

> and I don't know why this threads started to focus so much on command line arguments
sorry.... I initially was replying to your suggestion of a new char8_t entry point, and got a bit sidetracked.

> to turn argv/argc into globals so they can be accessed by methods that would, depending on what the user ask for serve bytes, utf-8,
or something else.

Yeah I was going to suggest this, we already allow int main() {} as an entry point, and both linux and windows actually store the command line parameters as globals. I'm not sure exactly how environment variables are stored on windows, but on linux they are stored as globals. A function returning pointers to these globals could even return a platform specific type, with functions to explicitly request a certain format, enabling round trips. Also such a function could just give you one block for both argv and envp, instead of individual arguments (even linux stores the arguments as one block and then calls strlen in a loop to figure out where the boundaries are).

We might _only_ need the one function returning an implementation specified code-unit type, if you want some other encoding you can use the standard transcoding functions that we'll presumably add eventually (this time correctly specified!)


Potentially unrelated/unhelpful thought:

If such functionality were to be added it would be neat if the linux implementations worked even if procfs were not mounted (such as when the process is init) unless I'm mistaken fetching the command line through proc for your own process is making like two system calls to de-reference a pointer into your own address space.

From: Corentin Jabot <corentinjabot_at_[hidden]>
Sent: Thursday, July 29, 2021 2:53 PM
To: SG16 <sg16_at_[hidden]>
Cc: Charlie Barto <Charles.Barto_at_[hidden]>; Thiago Macieira <thiago_at_[hidden]>
Subject: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding



On Thu, Jul 29, 2021 at 10:28 PM Thiago Macieira via SG16 <mailto:sg16_at_[hidden]> wrote:
On Thursday, 29 July 2021 13:07:04 PDT Charlie Barto wrote:
> > The problem is that the way you wrote, it makes it sound like WTF-8 can be
> > used to hold invalid file paths on POSIX systems and round-trip those to
> > UTF-16. That doesn't work. Therefore, any cross-platform content that
> > attempts to transcode to UTF-16 will have to deal with undecodeable paths
> > any way.
> I think this is true for WTF-8 on platforms where the parameters can be
> arbitrary byte strings, but I don't think it's true in general. I think
> there are probably transcoding algorithms that will take valid utf-8 to
> equivalent, valid utf-16, and the reverse while also round tripping for all
> invalid values. PEP-383 might be able to do this. We probably don't want to
> invent a new encoding and apply it at startup 😊.

I agree that in reality, the strings will most likely be UTF-8. Not 100%
certain, but we should approach 99.9%.

And we should be mindful of that. Designing for the 99.9% use cases is, at the very least, a good starting point.
WTF-8 offers no benefit whatsoever over the status quo: it's untrusted bags of bytes that the user has to check.
And I'd be careful about standard facilities transporting anything other than UTF=8 in char8_t because that defeats its purpose.

That being said, (and I don't know why this threads started to focus so much on command line arguments), a solution might be
to turn argv/argc into globals so they can be accessed by methods that would, depending on what the user ask for serve bytes, utf-8, 
or something else.
Having them as parameter of main forces us to make a choice for everyone - or have different main signatures (which is all or nothing for all arguments)



I can look up the discussion in the Qt development mailing list a year or two
ago on the topic, but the summary of our conclusions were:
- the vast majority of Unix/POSIX systems are installed with UTF-8 by default
- all currently graphical Unix/POSIX systems end up requiring UTF-8
- systems that haven't updated to UTF-8 aren't likely to get news applications
- situations where UTF-8 isn't enabled are likely misconfigurations

The last point is relevant and changes when compared from Qt to "any purpose"
C++ applications. Qt applications are never system applications, so they only
start when the system has already been configured (for example, we also used
to require the Linux random number generator to work). So for us, printing a
warning that your system was misconfigured and then override to the expected
situation was an acceptable solution.

That may not be the case for "any purpose" C++, especially if we talk about
minimal environments found in containers and tiny embedded devices. Only
recently did glibc add built-in support for C.UTF-8, as opposed to requiring
that a locale be created and installed using localedef or packages. So there's
a high probability that those constrained systems will say "C.UTF-8" is not a
valid locale and will fall back to "C.ANSI_X3.4-1986".

And I hope this will be less and less true as time goes on: It is unlikely that people will look at using
C++23 on these system before glibc is updated

--
Thiago Macieira - thiago (AT) https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmacieira.info%2F&data=04%7C01%7CCharles.Barto%40microsoft.com%7C01b50997b9554d54daeb08d952db3b25%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637631923825780260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aNknE0XHVyGJLGvPcZiRkSjlbjV%2BgXbxOFWRJhJB58c%3D&reserved=0 - thiago (AT) https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fkde.org%2F&data=04%7C01%7CCharles.Barto%40microsoft.com%7C01b50997b9554d54daeb08d952db3b25%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637631923825790225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=D%2FDdIG6k0HDondJBYbJ1uvEOI4cZ181SG9epgofve64%3D&reserved=0
   Software Architect - Intel DPG Cloud Engineering



--
SG16 mailing list
mailto:SG16_at_[hidden]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&data=04%7C01%7CCharles.Barto%40microsoft.com%7C01b50997b9554d54daeb08d952db3b25%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637631923825790225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GuPl2goLMNJb%2F8VRfsDyxsi1uXE2X56GKwvH4jesxN8%3D&reserved=0


SG16 list run by sg16-owner@lists.isocpp.org