sg16: Re: [SG16-Unicode] It???s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Mon, 6 May 2019 20:11:30 +0300

On Sat, Apr 27, 2019 at 6:41 PM <keld_at_[hidden]> wrote:
> On Sat, Apr 27, 2019 at 01:53:00PM +0000, Lyberta wrote:
> > Querying properties of scalar values.
>
> some of it is there , isalpha etc.

Despite taking an int argument, isalpha operates on an 8-bit code unit
and not on a Unicode scalar value.

isalpha is fundamentally broken: It's locale-dependent, so it's
unsuitable for (portably) parsing ASCII protocol text. It operates
only on 8-bit code units, so it's unsuitable for processing natural
language in cases where a character doesn't fit in one code unit.
(Though it's unclear to me what kind of natural language processing
tasks present a use case for querying for the alphabeticness of
character generically without caring about subdividing further by
script.)

Additionally, on platforms where char is signed, isalpha is a UB trap:
Simply passing a char-typed value as the argument is UB when the most
significant bit is set.

The best course of action for isalpha and similar functions would be
to formally deprecate them to give compilers a standard-backed reason
to warn about their use.

On Sun, Apr 28, 2019 at 11:01 PM <keld_at_[hidden]> wrote:
> On Sun, Apr 28, 2019 at 11:04:58AM +0300, Henri Sivonen wrote:
> > On Sat, Apr 27, 2019 at 2:59 PM <keld_at_[hidden]> wrote:
> > >
> > > well, I am much against leaving the principle of character set neutrality in c++,
> > > and I am working to enhance cheracter set features in a pan-character set way
> >
> > But why? Do you foresee a replacement for Unicode for which
> > non-commitment to Unicode needs to be kept alive? What value is there
> > from pretending, on principle, that Unicode didn't win with no
> > realistic avenue for getting replaced--especially when other
> > programming languages, major GUI toolkits, and the Web Platform have
> > committed to the model where all text is conceptually (and
> > implementation-wise internally) Unicode but may be interchanged in
> > legacy _encodings_?
>
> I believe there are a number of encodings in East Asia that there will still be
> developed for for quite some time.

Do you have a concrete and specific concern related to East Asian
_encodings_ that would be a significant problem in the model where
_new_ text processing features are provided for one (or more) Unicode
Encoding Forms, i.e. conceptually for a sequence of Unicode scalar
values, and non-UTF _encodings_ need to be converted to a Unicode
Encoding Form first?

> major languages and toolkits and operating systems are still character set independent.

What major languages (other than C and C++) do you have in mind? What
major toolkits and operating systems do you have in mind?

Major languages that only support Unicode as the single coded
character set (i.e. number assignment for abstract characters)
include:
Java
JavaScript
C#
Python 2
Python 3
Objective-C
Swift
Rust
Go

Major GUI toolkits for C or languages extended from C that only
support Unicode as the single coded character set (i.e. number
assignment for abstract characters) include:
Gtk
Qt
Cocoa
Win32

In the case of Win32, you can't do anything with the "A" APIs that
can't be explained by transcoding to Unicode followed by the use of
"W" APIs. (I haven't seen the source code, but I'm fairly confident
that that's also how the "A" functions are actually implemented.)

When all these, and more, are already committed to Unicode as the only
numbering scheme for abstract characters, it seems implausible for a
different scheme to take over.

> and some people are not happy with the unicode consortium.

When something has the complexity and scope of Unicode, it is to be
expected that someone is unhappy about something. What's relevant in
terms of what C++ should support is that there is no alternative
numbering of abstract characters and collection of associated
algorithms with a scope comparable to Unicode that could present a
serious alternative. Even if one contemplated scenario of the Unicode
Consortium going into the weeds in such a way that some other entity
would need to take over, it seems implausible that the numbers
assigned to abstract characters up to that point wouldn't be adopted
by such other entity. In that sense, it does not seem useful not to
commit to Unicode as the only supported numbering scheme for abstract
characters.

> why abandon a model that still delivers for all?

Because abstractions that try not to commit to Unicode are not free.

On Mon, Apr 29, 2019 at 8:53 PM <keld_at_[hidden]> wrote:
> mojibake never hit me

I started this thread with the subject line that contained a non-ASCII
character. I observe that your email client was the one that messed up
that character in the subject line from your reply onward. :-)

On Mon, Apr 29, 2019 at 9:11 PM <keld_at_[hidden]> wrote:
> I think some unicode is not well designed, like ucs16.

UCS2 has been replaced by UTF-16, and C++ would do well to prioritize
UTF-8 over UTF-16. Every long-lived standard has some regrettable
design decisions. That UCS2 was a mistake is not a reason to reject
Unicode.

On Tue, Apr 30, 2019 at 3:12 PM <keld_at_[hidden]> wrote:
> still. the same binary linux kernel an glibc lib can work with a myriad of charsets,

I haven't examined all parts of glibc, but at least the iconv part
works by pivoting via Unicode scalar values. So at least that part of
glibc already internally implements the model of converting legacy
encodings to Unicode and Unicode to legacy encodings.

On Wed, May 1, 2019 at 10:41 AM JeanHeyd Meneide
<phdofthehouse_at_[hidden]> wrote:
> [ On the POSIX locale and the C locale ]
> The POSIX locale is the only fundamentally broken default (it mandates a single-byte encoding) in a multibyte world. The C locale is much more permissive and therefore gives us much more wiggle room. This is why it would make sense to work with the C and C++ Committees to jointly move towards a default C.UTF8 locale as the default locale rather than just C (a lot of implementations pick "C" to just mean POSIX).

Indeed, making C.UTF-8 the default would make a lot of sense. Thank
you for working on this.

> Moving to C.UTF8 as a mandated default with adherence to the wisdom other languages found in porting over (see: https://www.python.org/dev/peps/pep-0538/) might help us in turning things around. While <iostream>s is broken, many of the actual converting C functions are safe because of std::mbstate_t and the way they work (with DR 488 and friends applied: many thanks to Philipp Krause). I still have to check if all the single-conversion functions also have C standard "s" versions of the mbr/w/8/16/32-conversion functions equivalents for doing multi code point processing, so that implementations can opt into their shiny SIMD processing.

When decoding input from an I/O source, it is necessary to have a
decoder object that can encapsulate conversion state such that the
conversion can be interrupted at arbitrary points. However, when a
converting contiguous in-RAM string from one Unicode Encoding Form to
another, it's worthwhile to consider a conversion function with the
semantics of TextEncoder.encodeInto() from the Encoding Standard to
avoid having to allocate for the worst case when converting from
UTF-16 string view into UTF-8 string:
https://encoding.spec.whatwg.org/#dom-textencoder-encodeinto

This function reads incoming potentially invalid UTF-16 as if unpaired
surrogates had been replaced with U+FFFD (this is what USVString means
in WebIDL) and converts as many complete Unicode scalar values into
UTF-8 as can fit into the output buffer. It returns the number of
UTF-16 code units read and the number of UTF-8 code units written.
(The main use case for TextEncoder.encodeInto() is converting
JavaScript strings into UTF-8 strings residing inside the Wasm heap.)

Firefox's internal non-std::string C++ string implementation uses
internally a SIMD-accelerated function with the above semantics when
converting a UTF-16 span into a UTF-8 string (in a way that doesn't
allocate for the worst case upfront). By "internally" I mean that the
user of the string library sees an API whose input is a UTF-16 span
and output is an owning UTF-8 string, so the incrementalism isn't
exposed to the user of the string library. (UTF-8 to UTF-16 conversion
allocates only once and, therefore, does not need partial conversion
function.)

> Step 2 is Normalization. It is the last part of Unicode that can be done localization-agnostically. No standard currently ships any interface for doing this: typically, the code written is non-portable for this because many OSes and frameworks make an assumption about what the user wants out (except perhaps on Windows, where MultibyteToWideChar will optionally compose/decompose based on a passed-in parameter flag). While this is fine for OSes and large developer codebases (e.g. Chrome, Firefox), it's not fine for people who will develop their tools on top of the standard: they need to be able to pick the normalization form right for their processing / users / operating system / etc.

It is not clear to me what the remark about Chrome and Firefox is
meant to communicate here. What do you mean? Are you suggesting that
the normalization should be part of encoding conversion as is the case
with MultibyteToWideChar?

-- 
Henri Sivonen
hsivonen_at_[hidden]
https://hsivonen.fi/

Received on 2019-05-06 20:08:07