sg16: Re: [SG16] Locales, Encodings and Unicode

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 11 Jan 2020 10:51:42 +0100

On Fri, 10 Jan 2020 at 22:07, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 08/01/2020 20.15, Corentin Jabot via SG16 wrote:
> > Hello
> > Here is a paper attempting to describe some of the issue with the
> <locale> facilities
> > I offer a a few solutions to explore but there is no denying it will be
> an uphill battle to remedy some of these issues.
> >
> > My goal was mostly to have a document we can refer people to and have a
> basis of conversation for ourselves.
> >
> > https://github.com/cor3ntin/CPPProposals/raw/master/P2020/P2020.pdf
>
> I think a key observation here is that locale and encoding
> need to get a divorce. And that probably means std::locale
> needs to die (in its present shape and form).
>

It is not quite clear to me that we can't keep the name - otherwise yes

>
> To me, it seems the feature set of the current C or C++
> localization facilities are so much sub-par that nobody
> essentially uses them for anything serious. So, there
> is little motivation to keep them except as a deprecated
> thing.
>
> I've heard that ICU is quite comprehensive in feature coverage,
> so any future design should take that into account.
>

Yes, that has to be a design goal (notably because the CLDR data are quite
big so being able to rely on the platform's icu (ICU is shipped virtually
everywhere)
has some benefits.

> Regarding encoding, here's a situation I'm not sure how to
> handle:
>
> Suppose I have an xterm on my desktop configured for UTF-8,
> and another xterm configured for (say) ISO 8859-1. I'm now
> running the same binary in both xterms. What should happen?
> It seems inefficient and possibly burdensome to support
> one of several runtime-chosen encodings at every step of my
> program, so the recommendation probably is to have a
> (statically chosen) program-internal encoding (likely UTF-8
> or UTF-32) plus conversion facilities that can convert to
> the environment's encoding.
>

My guess is that the expectation at the time was that you would recompile.
What happens currently is that you get mojibake (
https://en.wikipedia.org/wiki/Mojibake )
What should probably happen is that when doing IO you have implicit
transcoding to UTF-8
We often talk about a Unicode sandwich where everything is stored as utf-8
and converted at i/o boundary.

>
> Whatever we do here, the programmer should have the ability
> to opt-out of any locale support (beyond "C") and any
> encoding conversion to keep the program footprint small for
> situations where advanced locale/encoding fun is not needed.
>

Agreed.

>
> Jens
>

Received on 2020-01-11 03:54:26