sg16: Re: [SG16] Locales, Encodings and Unicode

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 25 Jan 2020 11:31:00 +0100

On Sat, Jan 25, 2020, 10:27 Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 24/01/2020 21.57, Corentin Jabot wrote:
> > Locales do have encoding requirements.
>
> Well, to be more precise, they have requirements on the
> character set ("must contain French accented characters"),
> but not on the actual encoding.
>

Yes, but that implies a requirement on the encoding too!

>
> If you have Unicode as the character set, a French locale
> is happy regardless of whether the encoding is UTF-8 or
> UTF-16 or UTF-32 or whatever.
>
> > If you want to format a date, February is février in France which cannot
> be encoded in an en_US.ASCII locale.
>
> Your last word here is what I would really like to see
> eradicated from the discussion except when prefixed with
> POSIX or so. The (abstract) locale is "en_US", and that
> locale would never produce "février"

I agree that we need better terminology. POSIX locale is less confusing
indeed.

The windows model ( any model really ) has that same limitation
Localization requires to use abstract characters that may not be available
in some character sets.
As such, if a program / function is trying to do some kind of localization
of text encoded with a non Unicode encoding, then there is a requirements
of the encoding used to be able to represent all the characters
which that localization may involved.

Changing the global locale from en_US to fr_FR requires changing the
encoding too, if the encoding is not assumed to be an Unicode encoding.

But then again, you don't get to choose the encoding when doing i/o.
In the case of a console, the environment detect what encoding should be
used.

Which leaves us with a few choices:

   - Use UTF-8
   - Don't use locales that are not the system locales or are not
   representable in the environment encoding

If the conclusion of that that the only workable solution is to require
utf8 in future localization facilities?
Maybe.
But it puts requirements on the environment - "Just use utf-8" is not
currently workable on windows 7 / zOs, etc
(On windows 10 we can actually force the program's environment to use utf8
- which is a good solution for new programs)

> That's is why historically these things are related.
> > It is also why character classification is related to locale despite
> these things being orthogonal.
>
> I thought we wanted to fix historic accidents instead of
> trying to preserve them? <cctype> should be left alone
> by SG16; if you want Unicode-style character classification,
> use a facility designed for that.

If you read p2020 you know we agree (see also P 1628:) - I was just making
a note that localization, character classification and encoding are 3
different things

>
> Jens
>
>

Received on 2020-01-25 04:33:47