On Thu, Jul 29, 2021 at 10:28 PM Thiago Macieira via SG16 <sg16@lists.isocpp.org> wrote:
On Thursday, 29 July 2021 13:07:04 PDT Charlie Barto wrote:
> > The problem is that the way you wrote, it makes it sound like WTF-8 can be
> > used to hold invalid file paths on POSIX systems and round-trip those to
> > UTF-16. That doesn't work. Therefore, any cross-platform content that
> > attempts to transcode to UTF-16 will have to deal with undecodeable paths
> > any way.
> I think this is true for WTF-8 on platforms where the parameters can be
> arbitrary byte strings, but I don't think it's true in general. I think
> there are probably transcoding algorithms that will take valid utf-8 to
> equivalent, valid utf-16, and the reverse while also round tripping for all
> invalid values. PEP-383 might be able to do this. We probably don't want to
> invent a new encoding and apply it at startup 😊.

I agree that in reality, the strings will most likely be UTF-8. Not 100%
certain, but we should approach 99.9%.

And we should be mindful of that. Designing for the 99.9% use cases is, at the very least, a good starting point.
WTF-8 offers no benefit whatsoever over the status quo: it's untrusted bags of bytes that the user has to check.
And I'd be careful about standard facilities transporting anything other than UTF=8 in char8_t because that defeats its purpose.

That being said, (and I don't know why this threads started to focus so much on command line arguments), a solution might be
to turn argv/argc into globals so they can be accessed by methods that would, depending on what the user ask for serve bytes, utf-8, 
or something else.
Having them as parameter of main forces us to make a choice for everyone - or have different main signatures (which is all or nothing for all arguments)

I can look up the discussion in the Qt development mailing list a year or two
ago on the topic, but the summary of our conclusions were:
- the vast majority of Unix/POSIX systems are installed with UTF-8 by default
- all currently graphical Unix/POSIX systems end up requiring UTF-8
- systems that haven't updated to UTF-8 aren't likely to get news applications
- situations where UTF-8 isn't enabled are likely misconfigurations

The last point is relevant and changes when compared from Qt to "any purpose"
C++ applications. Qt applications are never system applications, so they only
start when the system has already been configured (for example, we also used
to require the Linux random number generator to work). So for us, printing a
warning that your system was misconfigured and then override to the expected
situation was an acceptable solution.

That may not be the case for "any purpose" C++, especially if we talk about
minimal environments found in containers and tiny embedded devices. Only
recently did glibc add built-in support for C.UTF-8, as opposed to requiring
that a locale be created and installed using localedef or packages. So there's
a high probability that those constrained systems will say "C.UTF-8" is not a
valid locale and will fall back to "C.ANSI_X3.4-1986".

And I hope this will be less and less true as time goes on: It is unlikely that people will look at using
C++23 on these system before glibc is updated

Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

SG16 mailing list