C++ Logo


Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 29 Jul 2021 23:52:48 +0200
On Thu, Jul 29, 2021 at 10:28 PM Thiago Macieira via SG16 <
sg16_at_[hidden]> wrote:

> On Thursday, 29 July 2021 13:07:04 PDT Charlie Barto wrote:
> > > The problem is that the way you wrote, it makes it sound like WTF-8
> can be
> > > used to hold invalid file paths on POSIX systems and round-trip those
> to
> > > UTF-16. That doesn't work. Therefore, any cross-platform content that
> > > attempts to transcode to UTF-16 will have to deal with undecodeable
> paths
> > > any way.
> > I think this is true for WTF-8 on platforms where the parameters can be
> > arbitrary byte strings, but I don't think it's true in general. I think
> > there are probably transcoding algorithms that will take valid utf-8 to
> > equivalent, valid utf-16, and the reverse while also round tripping for
> all
> > invalid values. PEP-383 might be able to do this. We probably don't want
> to
> > invent a new encoding and apply it at startup 😊.
> I agree that in reality, the strings will most likely be UTF-8. Not 100%
> certain, but we should approach 99.9%.

And we should be mindful of that. Designing for the 99.9% use cases is, at
the very least, a good starting point.
WTF-8 offers no benefit whatsoever over the status quo: it's untrusted bags
of bytes that the user has to check.
And I'd be careful about standard facilities transporting anything other
than UTF=8 in char8_t because that defeats its purpose.

That being said, (and I don't know why this threads started to focus so
much on command line arguments), a solution might be
to turn argv/argc into globals so they can be accessed by methods that
would, depending on what the user ask for serve bytes, utf-8,
or something else.
Having them as parameter of main forces us to make a choice for everyone -
or have different main signatures (which is all or nothing for all

> I can look up the discussion in the Qt development mailing list a year or
> two
> ago on the topic, but the summary of our conclusions were:
> - the vast majority of Unix/POSIX systems are installed with UTF-8 by
> default
> - all currently graphical Unix/POSIX systems end up requiring UTF-8
> - systems that haven't updated to UTF-8 aren't likely to get news
> applications
> - situations where UTF-8 isn't enabled are likely misconfigurations
> The last point is relevant and changes when compared from Qt to "any
> purpose"
> C++ applications. Qt applications are never system applications, so they
> only
> start when the system has already been configured (for example, we also
> used
> to require the Linux random number generator to work). So for us, printing
> a
> warning that your system was misconfigured and then override to the
> expected
> situation was an acceptable solution.
> That may not be the case for "any purpose" C++, especially if we talk
> about
> minimal environments found in containers and tiny embedded devices. Only
> recently did glibc add built-in support for C.UTF-8, as opposed to
> requiring
> that a locale be created and installed using localedef or packages. So
> there's
> a high probability that those constrained systems will say "C.UTF-8" is
> not a
> valid locale and will fall back to "C.ANSI_X3.4-1986".

And I hope this will be less and less true as time goes on: It is unlikely
that people will look at using
C++23 on these system before glibc is updated

> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel DPG Cloud Engineering
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2021-07-29 16:53:01