C++ Logo

sg16

Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Thiago Macieira <thiago_at_[hidden]>
Date: Thu, 29 Jul 2021 13:28:12 -0700
On Thursday, 29 July 2021 13:07:04 PDT Charlie Barto wrote:
> > The problem is that the way you wrote, it makes it sound like WTF-8 can be
> > used to hold invalid file paths on POSIX systems and round-trip those to
> > UTF-16. That doesn't work. Therefore, any cross-platform content that
> > attempts to transcode to UTF-16 will have to deal with undecodeable paths
> > any way.
> I think this is true for WTF-8 on platforms where the parameters can be
> arbitrary byte strings, but I don't think it's true in general. I think
> there are probably transcoding algorithms that will take valid utf-8 to
> equivalent, valid utf-16, and the reverse while also round tripping for all
> invalid values. PEP-383 might be able to do this. We probably don't want to
> invent a new encoding and apply it at startup 😊.

I agree that in reality, the strings will most likely be UTF-8. Not 100%
certain, but we should approach 99.9%.

I can look up the discussion in the Qt development mailing list a year or two
ago on the topic, but the summary of our conclusions were:
- the vast majority of Unix/POSIX systems are installed with UTF-8 by default
- all currently graphical Unix/POSIX systems end up requiring UTF-8
- systems that haven't updated to UTF-8 aren't likely to get news applications
- situations where UTF-8 isn't enabled are likely misconfigurations

The last point is relevant and changes when compared from Qt to "any purpose"
C++ applications. Qt applications are never system applications, so they only
start when the system has already been configured (for example, we also used
to require the Linux random number generator to work). So for us, printing a
warning that your system was misconfigured and then override to the expected
situation was an acceptable solution.

That may not be the case for "any purpose" C++, especially if we talk about
minimal environments found in containers and tiny embedded devices. Only
recently did glibc add built-in support for C.UTF-8, as opposed to requiring
that a locale be created and installed using localedef or packages. So there's
a high probability that those constrained systems will say "C.UTF-8" is not a
valid locale and will fall back to "C.ANSI_X3.4-1986".

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

Received on 2021-07-29 15:28:18