On Fri, 10 Jan 2020 at 22:07, Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 08/01/2020 20.15, Corentin Jabot via SG16 wrote:
> Hello
> Here is a paper attempting to describe some of the issue with the <locale> facilities
> I offer a a few solutions to explore but there is no denying it will be an uphill battle to remedy some of these issues.
>
> My goal was mostly to have a document we can refer people to and have a basis of conversation for ourselves.
>
> https://github.com/cor3ntin/CPPProposals/raw/master/P2020/P2020.pdf

I think a key observation here is that locale and encoding
need to get a divorce.  And that probably means std::locale
needs to die (in its present shape and form).

It is not quite clear to me that we can't keep the name - otherwise yes
 

To me, it seems the feature set of the current C or C++
localization facilities are so much sub-par that nobody
essentially uses them for anything serious.  So, there
is little motivation to keep them except as a deprecated
thing.

I've heard that ICU is quite comprehensive in feature coverage,
so any future design should take that into account.

Yes, that has to be a design goal (notably because the CLDR data are quite big so being able to rely on the platform's icu (ICU is shipped virtually everywhere)
has some benefits.


Regarding encoding, here's a situation I'm not sure how to
handle:

Suppose I have an xterm on my desktop configured for UTF-8,
and another xterm configured for (say) ISO 8859-1. I'm now
running the same binary in both xterms.  What should happen?
It seems inefficient and possibly burdensome to support
one of several runtime-chosen encodings at every step of my
program, so the recommendation probably is to have a
(statically chosen) program-internal encoding (likely UTF-8
or UTF-32) plus conversion facilities that can convert to
the environment's encoding.

My guess is that the expectation at the time was that you would recompile.
What happens currently is that you get mojibake ( https://en.wikipedia.org/wiki/Mojibake )
What should probably happen is that when doing IO you have implicit transcoding to UTF-8
We often talk about a Unicode sandwich where everything is stored as utf-8 and converted at i/o boundary.
 

Whatever we do here, the programmer should have the ability
to opt-out of any locale support (beyond "C") and any
encoding conversion to keep the program footprint small for
situations where advanced locale/encoding fun is not needed.

Agreed.

 

Jens