sg16: Re: [SG16-Unicode] It???s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Wed, 1 May 2019 03:41:14 -0400

[ On wide characters and portability ]
     Tom captured this perfectly: while the idea behind wide characters and
portability were okay, too many implementations bit the bullet too quickly
with regards to wchar_t and fixed it to be 16 bits. This means it is not
suitable for portable runtimes because the encoding employed by wchar_ts on
different systems are different, similar to char (Windows, IBM systems,
AIX, etc. all chose 16-bit wchar_t). Similar to narrow execution encoding,
wide execution encoding is a dead end.

     iswblank/isblank and literally everything in ctype predicated around
this is, thusly, broken. <ctype> isn't alone: C++ took this interface and
cast it in the everlasting ABI Steel of virtual functions in its codecvt,
facets (money_punct. etc.). Most of these take only a single CharT in the
C++ interfaces -- a single 16-bit wchar_t -- or return only a single 16-bit
value. Therefore, the entire interface -- while workable for Linux systems
which correctly settled on a 32-bit wchar_t -- becomes a portability dead
end. It's incredibly unfortunate that the interface was not made iterator
based, but hindsight is 20/20 and all that.

     We cannot duplicate iostream's interface for Unicode. That doesn't
help old applications, introduces new teaching traps, and the interface is
still absolutely busted for utf8 and utf16.
std::u8cout/std::u16cout/std::u32cout are no-gos and should not be pursued
at all.

[ On the POSIX locale and the C locale ]
     The POSIX locale is the only fundamentally broken default (it mandates
a single-byte encoding) in a multibyte world. The C locale is much more
permissive and therefore gives us much more wiggle room. This is why it
would make sense to work with the C and C++ Committees to jointly move
towards a default C.UTF8 locale as the default locale rather than just C (a
lot of implementations pick "C" to just mean POSIX).

     Moving to C.UTF8 as a mandated default with adherence to the wisdom
other languages found in porting over (see:
https://www.python.org/dev/peps/pep-0538/) might help us in turning things
around. While <iostream>s is broken, many of the actual converting C
functions are safe because of std::mbstate_t and the way they work (with DR
488 and friends applied: many thanks to Philipp Krause). I still have to
check if all the single-conversion functions also have C standard "s"
versions of the mbr/w/8/16/32-conversion functions equivalents for doing
multi code point processing, so that implementations can opt into their
shiny SIMD processing.

[ Where to from Here? ]
     Step 0 is actually giving the C++ standard tools to handle multibyte
encodings. We can give multibyte encodings a break in <iostream>s by
striking the clause from here:
http://eel.is/c++draft/locale.codecvt#virtuals-3. This allows multibyte
encodings into at least the codecvt part of iostreams with basic_filebuf
(which, turns out, affects a lot of things, especially the things based on
file descriptors). There's nothing we can do to fix the rest of
<iostream>s: facets and money_punct and friends are all irreversibly
broken, and given the emphasis on ABI these days, broken Mostly Forever™.

     Step 1 is then starting with the Encoding alternatives. Philipp Krause
already has many papers written on UTF functions for WG14; these will
ensure that at minimum we will have escape hatches for proper Unicode
outside of wchar_t/char, and will have some degree of success in converting
narrow/wide execution to Unicode. WG21 is working on proposals for UTF
encodings as well, and how to plug in one's own encodings (that's what I
and a few others are on the hook for).

     Step 2 is Normalization. It is the last part of Unicode that can be
done localization-agnostically. No standard currently ships any interface
for doing this: typically, the code written is non-portable for this
because many OSes and frameworks make an assumption about what the user
wants out (except perhaps on Windows, where MultibyteToWideChar will
optionally compose/decompose based on a passed-in parameter flag). While
this is fine for OSes and large developer codebases (e.g. Chrome, Firefox),
it's not fine for people who will develop their tools on top of the
standard: they need to be able to pick the normalization form right for
their processing / users / operating system / etc.

     From there, in no particular order: new localization /
internationalization interfaces based on Unicode Scalar Values (char32_t or
a strong typedef, that question is still up in the air). It can also be
encoding-based on char8/16_t as well, but we must make sure it allows
ranges of values rather than single code units to avoid the mistakes of the
past. Bidirectional algorithms. Collation,
case_fold/to_lower/to_upper/to_titlecase/ etc. Regular expressions.

     After Steps 0-2, this list need not be tackled in order.

Sincerely,
JeanHeyd

P.S.: nobody agreed on this list. It's just what I think are the problems
and what we should be doing.

Received on 2019-05-01 09:41:31