On Sat, Jan 25, 2020, 10:27 Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 24/01/2020 21.57, Corentin Jabot wrote:
> Locales do have encoding requirements.

Well, to be more precise, they have requirements on the
character set ("must contain French accented characters"),
but not on the actual encoding.

Yes, but that implies a requirement on the encoding too!

If you have Unicode as the character set, a French locale
is happy regardless of whether the encoding is UTF-8 or
UTF-16 or UTF-32 or whatever.

> If you want to format a date, February is février in France which cannot be encoded in an en_US.ASCII locale.

Your last word here is what I would really like to see
eradicated from the discussion except when prefixed with
POSIX or so.  The (abstract) locale is "en_US", and that
locale would never produce "février"

I agree that we need better terminology. POSIX locale is less confusing indeed.

The windows model ( any model really ) has that same limitation
Localization requires to use abstract characters that may not be available in some character sets.
As such, if a program / function is trying to do some kind of localization of text encoded with a non Unicode encoding, then there is a requirements of the encoding used to be able to represent all the characters
which that localization may involved.

Changing the global locale from en_US to fr_FR requires changing the encoding too, if the encoding is not assumed to be an Unicode encoding.

But then again, you don't get to choose the encoding when doing i/o.
In the case of a console, the environment detect what encoding should be used.

Which leaves us with a few choices:
  • Use UTF-8
  • Don't use locales that are not the system locales or are not representable in the environment encoding
If the conclusion of that that the only workable solution is to require utf8 in future localization facilities?
Maybe.
But it puts requirements on the environment  - "Just use utf-8" is not currently workable on windows 7 / zOs, etc
(On windows 10 we can actually force the program's environment to use utf8 - which is a good solution for new programs)


> That's is why historically these things are related.
> It is also why character classification is related to locale despite these things being orthogonal.

I thought we wanted to fix historic accidents instead of
trying to preserve them?  <cctype> should be left alone
by SG16; if you want Unicode-style character classification,
use a facility designed for that.

If you read p2020 you know we agree  (see also P 1628:) - I was just making a note that localization, character classification and encoding are 3 different things

Jens