sg16: Re: [SG16-Unicode] Draft SG16 direction paper

From: Markus Scherer <markus.icu_at_[hidden]>
Date: Mon, 8 Oct 2018 22:39:22 -0700

On Mon, Oct 8, 2018 at 7:45 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 10/08/2018 12:38 PM, Markus Scherer wrote:
>
> > ICU supports customization of its internal code unit type, but char16_t is
> used by default, following ICU’s adoption of C++11.
>
> Not quite... ICU supports customization of its code unit type *for C APIs*.
> Internally, and in C++ APIs, we switched to char16_t. And because that
> broke call sites, we mitigated where we could with overloads and shim
> classes.
>
>
> Ah, thank you for the correction. If we end up submitting a revision of
> the paper, I'll include this correction. I had checked the ICU sources (
> include/unicode/umachine.h) and verified that the UChar typedef was
> configurable, but I didn't realize that configuration was limited to C code.
>

We limited it to C API by doing s/UChar/char16_t/g in C++ API, except where
we replaced a raw pointer with a shim class. So you won't see "UChar" in
C++ API any more at all.

Internally to compiling ICU itself, we kept UChar in existing code (so that
we didn't have to change tens of thousands of lines) but fixed it to be a
typedef for char16_t.

Unfortunately, if UChar is configured != char16_t, you need casts or cast
helpers for using C APIs from C++ code.

It would be interesting to get more perspective on how and why ICU evolved
> like it did. What was the motivation for ICU to switch to char16_t?
> Were the anticipated benefits realized despite the perhaps unanticipated
> complexities?
>

We assumed that C++ code was going to adopt char16_t and maybe
std::u16string, and we wanted it to be easy for ICU to work with those
types.

In particular, the string literals weighed heavily. For the most part, we
can ignore the standard library when it comes to Unicode, but the previous
lack of real UTF-16 string literals was extremely inconvenient. We used to
have all kinds of static const UChar arrays with numeric intializer lists,
or init-once code for setting up string "constants", even when they
contained only ASCII characters.

Now that we can use u"literals" we managed to clean up some of our code,
and new library code and especially new unit test code benefits greatly.

If Windows were to suddenly sprout Win32 interfaces defined in terms of
> char16_t, would the pain be substantially relieved?
>

No. Many if not most of our users are not on Windows, or at least not only
on Windows. UTF-16 is fairly widely used.

Anyway, I doubt that Windows will do that. Operating systems want to never
break code like this, and these would all be duplicates.
Although I suppose they could do it as a header-only shim.

Microsoft was pretty unhappy with this change in ICU. They went with it
because they were early in their integration of ICU into Windows.

They also have some fewer problems: I believe they concluded that the
aliasing trick was so developer-hostile that they decided never to optimize
based on it, at least for the types involved. I don't think our aliasing
barrier is defined on Windows.

If u"literals" had just been uint16_t* without a new type, then we could
have used string literals without changing API and breaking call sites, on
most platforms anyway. And if uint16_t==wchar_t on Windows, then that would
have been fine, too.

Note: Of course there are places where we use uint16_t* binary data, but
there is never any confusion whether a function works with binary data vs.
a string. You just wouldn't use the same function or name for unrelated
operations.

Note also: While most of ICU works with UTF-16, we do have some UTF-8
functions. We distinguish the two with different function names, such
as in class
CaseMap <http://icu-project.org/apiref/icu4c/classicu_1_1CaseMap.html>
(toLower() vs. utf8ToLower()).

If we had operations that worked on both UTF-8 and some other charset, we
would also use different names.

Are code bases that use ICU on non-Windows platforms (slowly) migrating
> from uint16_t to char16_t?
>

I don't remember what Chromium and Android ended up doing. You could take a
look at their code.

If you do want a distinct type, why not just standardize on uint8_t? Why
> does it need to be a new type that is distinct from that, too?
>
> Lyberta provided one example; we do need to be able to overload or
> specialize on character vs integer types.
>

I don't find the examples so far convincing. Overloading on primitive types
to distinguish between UTF-8 vs. one or more legacy charsets seems both
unnecessary and like bad practice. Explicit naming of things that are
different is good.

What makes sense to me is that "char" can be signed, and that's bad for
dealing with non-ASCII characters.
In ICU, when I get to actual UTF-8 processing, I tend to either cast each
byte to uint8_t or cast the whole pointer to uint8_t* and call an internal
worker function.
Somewhat ironically, the fastest way to test for a UTF-8 trail byte is via
the opposite cast, testing if (int8_t)b<-0x40.

This is why I said it would be much simpler if the "char" default could be
changed to be unsigned.
I realize that non-portable code that assumes a signed char type would then
need the opposite command-line option that people now use to force it to
unsigned.

Since uint8_t is conditionally supported, we can't rely on its existence
> within the standard (we'd have to use unsigned char or uint_least8_t
> instead).
>

I seriously doubt that there is a platform that keeps up with modern C++
and does not have a real uint8_t.

ICU is one of the more widely portable libraries (or was, until we adopted
C++11 and left some behind) and would likely fail royally if the uint8_t
and uint16_t types we are using were actually wider than advertised and
revealed larger values etc. Since ICU is also widely used, that would break
a lot of systems. But no one has ever reported a bug (or request for
porting patches) related to non-power-of-2 integer types.

I think there is value in maintaining consistency with char16_t and char32_t.
> char8_t provides the missing piece needed to enable a clean, type safe,
> external vs internal encoding model that allows use of any of UTF-8,
> UTF-16, or UTF-32 as the internal encoding, that is easy to teach, and that
> facilitates generic libraries like text_view that work seamlessly with any
> of these encodings.
>

Maybe. I don't see the need to use the same function names for a variety of
legacy charsets vs. UTF-8.

20 years ago I wrote a set of macros that looked the same but had versions
for UTF-8, UTF-16, and UTF-32. I briefly thought we could make (some of?)
ICU essentially switchable between UTFs. I quickly learned that any real,
non-trivial code you would want to write for either of them wants to be
specific to that UTF, especially when people want text processing to be
fast. (You can see remnants of this youthful folly in ICU's
unicode/utf_old.h header file.)

Best regards,
markus

Received on 2018-10-09 07:39:38