sg16: Re: [SG16] Locales, Encodings and Unicode

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 24 Jan 2020 21:57:08 +0100

On Fri, Jan 24, 2020, 18:27 Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 24/01/2020 05.44, Tom Honermann wrote:
> > Back to your example. I think what should happen is that the program
> > should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
> > variables are set consistently with your xterm configurations, call
> > setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in
>
> setlocale(LC_ALL, "") is ugly because it's not thread-safe wrt.
> uses of the locale information. Can we get at the environment's
> LC_CTYPE information without such stupid side-effects? Hm...
> It seems newlocale(0xff, "", 0) or similar with nl_langinfo_l would do it
> (on POSIX). Corentin, you mentioned in the call you needed to set
> away a pointer before the standard-mandated setlocale(LC_ALL, "C") runs,
> to preserve the original encoding information. Would the path outlined
> above make that unnecessary?
>

After the call I got that working:

locale_t loc = newlocale(LC_CTYPE_MASK, "", (locale_t)0);
const char* name = nl_langinfo_l(CODESET, loc);
const int mib = details::find_encoding(name);
freelocale(loc);

This can be done when calling the function and eventually be cached.

>
> > terms of the environment configured locale, use char8_t and UTF-8 as an
> > internal encoding (along with fancy new text processing interfaces that
> > we have yet to design), and transcode using the fancy new interfaces
> > JeanHeyd is working on to the environment configured locale when
> > performing text based I/O. In short, use char assuming the environment
> > configured locale when working directly with I/O provided text, use
> > char8_t for internally maintained text, and transcode between them as
> > necessary.
>
> Funnily enough, my "setlocale" man page associates LC_CTYPE with
> "Character classification", but nl_langinfo actually says
> it's returning the character encoding.
>
> In general, I would find it less confusing if the description
> vocabulary would consider "locale" and "character set / encoding"
> as orthogonal and essentially unrelated. The "locale" as the
> set of cultural preferences for the expression of certain things
> reaches beyond computers; people in Germany have used the decimal
> comma (not the decimal point) long before computers existed.
> In contrast, we have been talking about character encoding only
> as long as we had computers, which want to express everything
> (even text) as numbers.
>
> The fact that on Unix/POSIX, the environment character set /
> encoding is conveyed via locale-related environment variables
> is just an implementation artifact and not really interesting
> for the C++-level discussion.
>

I strongly agree but it's a lot more nuanced.
Locales do have encoding requirements.

If you want to format a date, February is février in France which cannot be
encoded in an en_US.ASCII locale.

That's is why historically these things are related.
It is also why character classification is related to locale despite these
things being orthogonal.

> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-01-24 14:59:55