Date: Fri, 24 Jan 2020 13:25:12 -0500
On 1/24/20 12:27 PM, Jens Maurer wrote:
> On 24/01/2020 05.44, Tom Honermann wrote:
>> Back to your example. I think what should happen is that the program
>> should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
>> variables are set consistently with your xterm configurations, call
>> setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in
> setlocale(LC_ALL, "") is ugly because it's not thread-safe wrt.
> uses of the locale information.
Agreed. Calling this would have to be something the program does as
part of its early initialization.
However, if the program is going to process any incoming text using
standard library facilities without first transcoding it, then such a
call is needed to handle characters outside the basic execution
character set correctly (I'm deferring to that term as a proxy for the
subset of characters that are common to all supported locale character
sets).
> Can we get at the environment's
> LC_CTYPE information without such stupid side-effects? Hm...
> It seems newlocale(0xff, "", 0) or similar with nl_langinfo_l would do it
> (on POSIX). Corentin, you mentioned in the call you needed to set
> away a pointer before the standard-mandated setlocale(LC_ALL, "C") runs,
> to preserve the original encoding information. Would the path outlined
> above make that unnecessary?
I think the concern wasn't that the standard-mandated as-if call to set
the locale to "C" interferes; it is that any other program call to
setlocale changes what nl_langinfo reports. However, I think you are
right that newlocale can be used to retrieve the original locale (so
long as the process hasn't changed its LANG or LC_* environment
variables; and if it has, then the desired behavior is rather unclear
anyway).
>
>> terms of the environment configured locale, use char8_t and UTF-8 as an
>> internal encoding (along with fancy new text processing interfaces that
>> we have yet to design), and transcode using the fancy new interfaces
>> JeanHeyd is working on to the environment configured locale when
>> performing text based I/O. In short, use char assuming the environment
>> configured locale when working directly with I/O provided text, use
>> char8_t for internally maintained text, and transcode between them as
>> necessary.
> Funnily enough, my "setlocale" man page associates LC_CTYPE with
> "Character classification", but nl_langinfo actually says
> it's returning the character encoding.
Indeed. I assume someone make an "eh, close enough" decision a long
time ago.
>
> In general, I would find it less confusing if the description
> vocabulary would consider "locale" and "character set / encoding"
> as orthogonal and essentially unrelated. The "locale" as the
> set of cultural preferences for the expression of certain things
> reaches beyond computers; people in Germany have used the decimal
> comma (not the decimal point) long before computers existed.
> In contrast, we have been talking about character encoding only
> as long as we had computers, which want to express everything
> (even text) as numbers.
>
> The fact that on Unix/POSIX, the environment character set /
> encoding is conveyed via locale-related environment variables
> is just an implementation artifact and not really interesting
> for the C++-level discussion.
Agreed.
Tom.
>
> Jens
> On 24/01/2020 05.44, Tom Honermann wrote:
>> Back to your example. I think what should happen is that the program
>> should assume that the LANG, LC_ALL, and/or LC_CTYPE environment
>> variables are set consistently with your xterm configurations, call
>> setlocale(LC_ALL, "") so that char/wchar_t based interfaces work in
> setlocale(LC_ALL, "") is ugly because it's not thread-safe wrt.
> uses of the locale information.
Agreed. Calling this would have to be something the program does as
part of its early initialization.
However, if the program is going to process any incoming text using
standard library facilities without first transcoding it, then such a
call is needed to handle characters outside the basic execution
character set correctly (I'm deferring to that term as a proxy for the
subset of characters that are common to all supported locale character
sets).
> Can we get at the environment's
> LC_CTYPE information without such stupid side-effects? Hm...
> It seems newlocale(0xff, "", 0) or similar with nl_langinfo_l would do it
> (on POSIX). Corentin, you mentioned in the call you needed to set
> away a pointer before the standard-mandated setlocale(LC_ALL, "C") runs,
> to preserve the original encoding information. Would the path outlined
> above make that unnecessary?
I think the concern wasn't that the standard-mandated as-if call to set
the locale to "C" interferes; it is that any other program call to
setlocale changes what nl_langinfo reports. However, I think you are
right that newlocale can be used to retrieve the original locale (so
long as the process hasn't changed its LANG or LC_* environment
variables; and if it has, then the desired behavior is rather unclear
anyway).
>
>> terms of the environment configured locale, use char8_t and UTF-8 as an
>> internal encoding (along with fancy new text processing interfaces that
>> we have yet to design), and transcode using the fancy new interfaces
>> JeanHeyd is working on to the environment configured locale when
>> performing text based I/O. In short, use char assuming the environment
>> configured locale when working directly with I/O provided text, use
>> char8_t for internally maintained text, and transcode between them as
>> necessary.
> Funnily enough, my "setlocale" man page associates LC_CTYPE with
> "Character classification", but nl_langinfo actually says
> it's returning the character encoding.
Indeed. I assume someone make an "eh, close enough" decision a long
time ago.
>
> In general, I would find it less confusing if the description
> vocabulary would consider "locale" and "character set / encoding"
> as orthogonal and essentially unrelated. The "locale" as the
> set of cultural preferences for the expression of certain things
> reaches beyond computers; people in Germany have used the decimal
> comma (not the decimal point) long before computers existed.
> In contrast, we have been talking about character encoding only
> as long as we had computers, which want to express everything
> (even text) as numbers.
>
> The fact that on Unix/POSIX, the environment character set /
> encoding is conveyed via locale-related environment variables
> is just an implementation artifact and not really interesting
> for the C++-level discussion.
Agreed.
Tom.
>
> Jens
Received on 2020-01-24 12:27:47