sg16: Re: [SG16] Locales, Encodings and Unicode

From: Steve Downey <sdowney_at_[hidden]>
Date: Fri, 24 Jan 2020 13:24:47 -0500

With basically all of the work with text encoding now being done in terms
of 10646 / Unicode, "character set" isn't used much any more, except where
it was standardized early, such as in MIME formats, html charset, and the
C++ and C standards. Where the term "character set" is used today it tends
to mean the subset of characters from 10646 that is in use, although
repertoire is the official term of art, and encoding specifies the
numerical values used for those characters.

auto loc = newlocale(LC_CTYPE_MASK, "", 0)
auto name = nl_langinfo_l(CODESET, loc);

at least on a modern-ish posix system I think would work for getting the
environmentally requested encoding scheme. This is probably better than
proposals to attempt to figure out what the output is expecting, which can
be terribly complicated given pipes, tees. remote shells, X, and so on, at
least in unix-y environments.

For literal encodings, we want to be able to recover the encoding scheme
used for the translation unit, which often varies within a program. Which
is also a source of errors, but is what it is. It's neither the CODESET of
the "C" locale, nor of the "" locale. It would be nice to have a standard
name for the thing, which is currently bundled into execution character
set.

On Fri, Jan 24, 2020 at 12:27 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 24/01/2020 05.44, Tom Honermann wrote:
> > Back to your example. I think what should happen is that the program
> > should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
> > variables are set consistently with your xterm configurations, call
> > setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in
>
> setlocale(LC_ALL, "") is ugly because it's not thread-safe wrt.
> uses of the locale information. Can we get at the environment's
> LC_CTYPE information without such stupid side-effects? Hm...
> It seems newlocale(0xff, "", 0) or similar with nl_langinfo_l would do it
> (on POSIX). Corentin, you mentioned in the call you needed to set
> away a pointer before the standard-mandated setlocale(LC_ALL, "C") runs,
> to preserve the original encoding information. Would the path outlined
> above make that unnecessary?
>
> > terms of the environment configured locale, use char8_t and UTF-8 as an
> > internal encoding (along with fancy new text processing interfaces that
> > we have yet to design), and transcode using the fancy new interfaces
> > JeanHeyd is working on to the environment configured locale when
> > performing text based I/O. In short, use char assuming the environment
> > configured locale when working directly with I/O provided text, use
> > char8_t for internally maintained text, and transcode between them as
> > necessary.
>
> Funnily enough, my "setlocale" man page associates LC_CTYPE with
> "Character classification", but nl_langinfo actually says
> it's returning the character encoding.
>
> In general, I would find it less confusing if the description
> vocabulary would consider "locale" and "character set / encoding"
> as orthogonal and essentially unrelated. The "locale" as the
> set of cultural preferences for the expression of certain things
> reaches beyond computers; people in Germany have used the decimal
> comma (not the decimal point) long before computers existed.
> In contrast, we have been talking about character encoding only
> as long as we had computers, which want to express everything
> (even text) as numbers.
>
> The fact that on Unix/POSIX, the environment character set /
> encoding is conveyed via locale-related environment variables
> is just an implementation artifact and not really interesting
> for the C++-level discussion.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-01-24 12:27:33