Subject: Re: Locales, Encodings and Unicode
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-01-24 14:57:08
On Fri, Jan 24, 2020, 18:27 Jens Maurer via SG16 <sg16_at_[hidden]>
> On 24/01/2020 05.44, Tom Honermann wrote:
> > Back to your example. I think what should happen is that the program
> > should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
> > variables are set consistently with your xterm configurations, call
> > setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in
> setlocale(LC_ALL, "") is ugly because it's not thread-safe wrt.
> uses of the locale information. Can we get at the environment's
> LC_CTYPE information without such stupid side-effects? Hm...
> It seems newlocale(0xff, "", 0) or similar with nl_langinfo_l would do it
> (on POSIX). Corentin, you mentioned in the call you needed to set
> away a pointer before the standard-mandated setlocale(LC_ALL, "C") runs,
> to preserve the original encoding information. Would the path outlined
> above make that unnecessary?
After the call I got that working:
locale_t loc = newlocale(LC_CTYPE_MASK, "", (locale_t)0);
const char* name = nl_langinfo_l(CODESET, loc);
const int mib = details::find_encoding(name);
This can be done when calling the function and eventually be cached.
> > terms of the environment configured locale, use char8_t and UTF-8 as an
> > internal encoding (along with fancy new text processing interfaces that
> > we have yet to design), and transcode using the fancy new interfaces
> > JeanHeyd is working on to the environment configured locale when
> > performing text based I/O. In short, use char assuming the environment
> > configured locale when working directly with I/O provided text, use
> > char8_t for internally maintained text, and transcode between them as
> > necessary.
> Funnily enough, my "setlocale" man page associates LC_CTYPE with
> "Character classification", but nl_langinfo actually says
> it's returning the character encoding.
> In general, I would find it less confusing if the description
> vocabulary would consider "locale" and "character set / encoding"
> as orthogonal and essentially unrelated. The "locale" as the
> set of cultural preferences for the expression of certain things
> reaches beyond computers; people in Germany have used the decimal
> comma (not the decimal point) long before computers existed.
> In contrast, we have been talking about character encoding only
> as long as we had computers, which want to express everything
> (even text) as numbers.
> The fact that on Unix/POSIX, the environment character set /
> encoding is conveyed via locale-related environment variables
> is just an implementation artifact and not really interesting
> for the C++-level discussion.
I strongly agree but it's a lot more nuanced.
Locales do have encoding requirements.
If you want to format a date, February is fÃ©vrier in France which cannot be
encoded in an en_US.ASCII locale.
That's is why historically these things are related.
It is also why character classification is related to locale despite these
things being orthogonal.
> SG16 mailing list
SG16 list run by email@example.com