sg16: Re: [SG16] Locales, Encodings and Unicode

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 23 Jan 2020 23:44:49 -0500

On 1/10/20 4:07 PM, Jens Maurer via SG16 wrote:
> On 08/01/2020 20.15, Corentin Jabot via SG16 wrote:
>> Hello
>> Here is a paper attempting to describe some of the issue with the <locale> facilities
>> I offer a a few solutions to explore but there is no denying it will be an uphill battle to remedy some of these issues.
>>
>> My goal was mostly to have a document we can refer people to and have a basis of conversation for ourselves.
>>
>> https://github.com/cor3ntin/CPPProposals/raw/master/P2020/P2020.pdf
> I think a key observation here is that locale and encoding
> need to get a divorce. And that probably means std::locale
> needs to die (in its present shape and form).

Separating locale and encoding is possible with Unicode, but there are
still implementations in use that use non-Unicode encodings for
stdout/stdin, filenames, etc... Unless/until Unicode displaces those,
from a standard perspective, we'll need to acknowledge the association.

Even if locale and encoding were separated, there is still the reality
that non-English speakers can't read English regardless of what encoding
is used. Programs targeted at the general population still have to
localize text and, I suspect, a well designed localization facility
could hide many encoding related details. I'm not sure that separating
locale and encoding would actually solve that large a problem (it would
definitely help though).

>
> To me, it seems the feature set of the current C or C++
> localization facilities are so much sub-par that nobody
> essentially uses them for anything serious. So, there
> is little motivation to keep them except as a deprecated
> thing.
I agree.
>
> I've heard that ICU is quite comprehensive in feature coverage,
> so any future design should take that into account.
Absolutely.
>
> Regarding encoding, here's a situation I'm not sure how to
> handle:
>
> Suppose I have an xterm on my desktop configured for UTF-8,
> and another xterm configured for (say) ISO 8859-1. I'm now
> running the same binary in both xterms. What should happen?

The problem is actually a little worse than this since, in between your
xterm and the program lies the cooperatively maintained locale
encoding. The encoding used by your xterm isn't discoverable by the
program except through some non-standard side channel (I believe there
are no terminfo or similar capabilities for either querying or setting
the terminal encoding). On Linux/UNIX systems it is reasonable to
assume the terminal encoding matches the environment configured locale
(e.g., as indicated by the LANG, LC_ALL, or LC_CTYPE environment
variables). On Windows, the situation is different because the console
encoding almost always differs from the environment configured locale
(though you can query/change the console encoding on Windows). But of
course, regardless of what the environment configured locale is, the
program starts up with the "C" locale (a great example of getting the
default wrong). A call to setlocale(LC_ALL, "") will set the program to
match the environment configured locale, but of course the program could
also call setlocale(LC_ALL, "en_US.utf-8") which will suffice to
convince the program that the locale is what it wants it to be
regardless of the reality that exists outside the program.

Back to your example. I think what should happen is that the program
should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
variables are set consistently with your xterm configurations, call
setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in
terms of the environment configured locale, use char8_t and UTF-8 as an
internal encoding (along with fancy new text processing interfaces that
we have yet to design), and transcode using the fancy new interfaces
JeanHeyd is working on to the environment configured locale when
performing text based I/O. In short, use char assuming the environment
configured locale when working directly with I/O provided text, use
char8_t for internally maintained text, and transcode between them as
necessary.

> It seems inefficient and possibly burdensome to support
> one of several runtime-chosen encodings at every step of my
> program, so the recommendation probably is to have a
> (statically chosen) program-internal encoding (likely UTF-8
> or UTF-32) plus conversion facilities that can convert to
> the environment's encoding.
Oh, look, you already said what I said, but you did it in fewer words :)
>
> Whatever we do here, the programmer should have the ability
> to opt-out of any locale support (beyond "C") and any
> encoding conversion to keep the program footprint small for
> situations where advanced locale/encoding fun is not needed.

I think that is good advice that we should strive to keep in mind; many
programs do not require interaction with the full gamut of human
cultural diversity.

Thank you for joining our telecon yesterday. You had some fresh
perspectives to share and I felt appreciative of that.

Tom.

>
> Jens

Received on 2020-01-23 22:47:27