With basically all of the work with text encoding now being done in terms of 10646 / Unicode, "character set" isn't used much any more, except where it was standardized early, such as in MIME formats, html charset, and the C++ and C standards. Where the term "character set" is used today it tends to mean the subset of characters from 10646 that is in use, although repertoire is the official term of art, and encoding specifies the numerical values used for those characters.

auto loc = newlocale(LC_CTYPE_MASK, "", 0)

auto name = nl_langinfo_l(CODESET, loc);

at least on a modern-ish posix system I think would work for getting the environmentally requested encoding scheme. This is probably better than proposals to attempt to figure out what the output is expecting, which can be terribly complicated given pipes, tees. remote shells, X, and so on, at least in unix-y environments.

For literal encodings, we want to be able to recover the encoding scheme used for the translation unit, which often varies within a program. Which is also a source of errors, but is what it is. It's neither the CODESET of the "C" locale, nor of the "" locale. It would be nice to have a standard name for the thing, which is currently bundled into execution character set.

On Fri, Jan 24, 2020 at 12:27 PM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:

On 24/01/2020 05.44, Tom Honermann wrote:
> Back to your example. I think what should happen is that the program
> should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
> variables are set consistently with your xterm configurations, call
> setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in

setlocale(LC_ALL, "") is ugly because it's not thread-safe wrt.
uses of the locale information. Can we get at the environment's
LC_CTYPE information without such stupid side-effects? Hm...
It seems newlocale(0xff, "", 0) or similar with nl_langinfo_l would do it
(on POSIX). Corentin, you mentioned in the call you needed to set
away a pointer before the standard-mandated setlocale(LC_ALL, "C") runs,
to preserve the original encoding information. Would the path outlined
above make that unnecessary?

> terms of the environment configured locale, use char8_t and UTF-8 as an
> internal encoding (along with fancy new text processing interfaces that
> we have yet to design), and transcode using the fancy new interfaces
> JeanHeyd is working on to the environment configured locale when
> performing text based I/O. In short, use char assuming the environment
> configured locale when working directly with I/O provided text, use
> char8_t for internally maintained text, and transcode between them as
> necessary.

Funnily enough, my "setlocale" man page associates LC_CTYPE with
"Character classification", but nl_langinfo actually says
it's returning the character encoding.

In general, I would find it less confusing if the description
vocabulary would consider "locale" and "character set / encoding"
as orthogonal and essentially unrelated. The "locale" as the
set of cultural preferences for the expression of certain things
reaches beyond computers; people in Germany have used the decimal
comma (not the decimal point) long before computers existed.
In contrast, we have been talking about character encoding only
as long as we had computers, which want to express everything
(even text) as numbers.

The fact that on Unix/POSIX, the environment character set /
encoding is conveyed via locale-related environment variables
is just an implementation artifact and not really interesting
for the C++-level discussion.

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16