sg16: Re: [SG16] Locales, Encodings and Unicode

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Fri, 24 Jan 2020 18:27:16 +0100

On 24/01/2020 05.44, Tom Honermann wrote:
> Back to your example. I think what should happen is that the program
> should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
> variables are set consistently with your xterm configurations, call
> setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in

setlocale(LC_ALL, "") is ugly because it's not thread-safe wrt.
uses of the locale information. Can we get at the environment's
LC_CTYPE information without such stupid side-effects? Hm...
It seems newlocale(0xff, "", 0) or similar with nl_langinfo_l would do it
(on POSIX). Corentin, you mentioned in the call you needed to set
away a pointer before the standard-mandated setlocale(LC_ALL, "C") runs,
to preserve the original encoding information. Would the path outlined
above make that unnecessary?

> terms of the environment configured locale, use char8_t and UTF-8 as an
> internal encoding (along with fancy new text processing interfaces that
> we have yet to design), and transcode using the fancy new interfaces
> JeanHeyd is working on to the environment configured locale when
> performing text based I/O. In short, use char assuming the environment
> configured locale when working directly with I/O provided text, use
> char8_t for internally maintained text, and transcode between them as
> necessary.

Funnily enough, my "setlocale" man page associates LC_CTYPE with
"Character classification", but nl_langinfo actually says
it's returning the character encoding.

In general, I would find it less confusing if the description
vocabulary would consider "locale" and "character set / encoding"
as orthogonal and essentially unrelated. The "locale" as the
set of cultural preferences for the expression of certain things
reaches beyond computers; people in Germany have used the decimal
comma (not the decimal point) long before computers existed.
In contrast, we have been talking about character encoding only
as long as we had computers, which want to express everything
(even text) as numbers.

The fact that on Unix/POSIX, the environment character set /
encoding is conveyed via locale-related environment variables
is just an implementation artifact and not really interesting
for the C++-level discussion.

Jens

Received on 2020-01-24 11:29:55