Date: Sun, 28 Apr 2019 17:25:28 -0400
On Sun, Apr 28, 2019 at 4:01 PM <keld_at_[hidden]> wrote:
> I believe there are a number of encodings in East Asia that there will
> still be
> developed for for quite some time.
>
> major languages and toolkits and operating systems are still character set
> independent.
> some people believe that unicode has not won, and some people are not
> happy with
> the unicode consortium. why abandon a model that still delivers for all?
>
> keld
>
I think there's really only one thing that needs to be fixed, and that's
the POSIX and C locales. Right now, they force a by-requirement 256
single-byte encoding. (Chapter 6, Section 2, first sentence:
http://pubs.opengroup.org/onlinepubs/9699919799/).
This restriction is what has been utterly and absolutely destroying the
ability to behave properly with a large set of encodings deployed around
the world, including Unicode, as a default. I am actually spending time and
cycles now contacting people on the C Standards Committee and reaching out
to people to find the POSIX individuals responsible for overseeing this
standard: that the locale is a single-byte encoding is not "character set
independent": it means that only a small fraction (ASCII, or similar) can
possibly be the default C or POSIX locale. That Unicode (specifically,
UTF8) happens to work in C and C++ is because the defaults for many of the
implementations simply pass char/wchar_t/char16_t/char32_t through their
interfaces and do not touch it. But, the moment anyone uses facets or
locales in any meaningful manner, much of it falls over.
POSIX/C need to acknowledge that multibyte encodings are reasonable
defaults (not just recommended extensions, but plausible defaults). Until
then, no: the C standard does not deliver for all and actively harms the
development and growth of international text processing on large and small
hardware systems.
> I believe there are a number of encodings in East Asia that there will
> still be
> developed for for quite some time.
>
> major languages and toolkits and operating systems are still character set
> independent.
> some people believe that unicode has not won, and some people are not
> happy with
> the unicode consortium. why abandon a model that still delivers for all?
>
> keld
>
I think there's really only one thing that needs to be fixed, and that's
the POSIX and C locales. Right now, they force a by-requirement 256
single-byte encoding. (Chapter 6, Section 2, first sentence:
http://pubs.opengroup.org/onlinepubs/9699919799/).
This restriction is what has been utterly and absolutely destroying the
ability to behave properly with a large set of encodings deployed around
the world, including Unicode, as a default. I am actually spending time and
cycles now contacting people on the C Standards Committee and reaching out
to people to find the POSIX individuals responsible for overseeing this
standard: that the locale is a single-byte encoding is not "character set
independent": it means that only a small fraction (ASCII, or similar) can
possibly be the default C or POSIX locale. That Unicode (specifically,
UTF8) happens to work in C and C++ is because the defaults for many of the
implementations simply pass char/wchar_t/char16_t/char32_t through their
interfaces and do not touch it. But, the moment anyone uses facets or
locales in any meaningful manner, much of it falls over.
POSIX/C need to acknowledge that multibyte encodings are reasonable
defaults (not just recommended extensions, but plausible defaults). Until
then, no: the C standard does not deliver for all and actively harms the
development and growth of international text processing on large and small
hardware systems.
Received on 2019-04-28 23:25:40