On Mon, Oct 8, 2018 at 7:45 PM Tom Honermann <tom@honermann.net> wrote:

On 10/08/2018 12:38 PM, Markus Scherer wrote:

> ICU supports customization of its internal code unit type, but char16_t is used by default, following ICU’s adoption of C++11.

Not quite... ICU supports customization of its code unit type for C APIs. Internally, and in C++ APIs, we switched to char16_t. And because that broke call sites, we mitigated where we could with overloads and shim classes.

Ah, thank you for the correction. If we end up submitting a revision of the paper, I'll include this correction. I had checked the ICU sources (include/unicode/umachine.h) and verified that the UChar typedef was configurable, but I didn't realize that configuration was limited to C code.

We limited it to C API by doing s/UChar/char16_t/g in C++ API, except where we replaced a raw pointer with a shim class. So you won't see "UChar" in C++ API any more at all.

Internally to compiling ICU itself, we kept UChar in existing code (so that we didn't have to change tens of thousands of lines) but fixed it to be a typedef for char16_t.

Unfortunately, if UChar is configured != char16_t, you need casts or cast helpers for using C APIs from C++ code.

It would be interesting to get more perspective on how and why ICU evolved like it did. What was the motivation for ICU to switch to char16_t? Were the anticipated benefits realized despite the perhaps unanticipated complexities?

We assumed that C++ code was going to adopt char16_t and maybe std::u16string, and we wanted it to be easy for ICU to work with those types.

In particular, the string literals weighed heavily. For the most part, we can ignore the standard library when it comes to Unicode, but the previous lack of real UTF-16 string literals was extremely inconvenient. We used to have all kinds of static const UChar arrays with numeric intializer lists, or init-once code for setting up string "constants", even when they contained only ASCII characters.

Now that we can use u"literals" we managed to clean up some of our code, and new library code and especially new unit test code benefits greatly.

If Windows were to suddenly sprout Win32 interfaces defined in terms of char16_t, would the pain be substantially relieved?

No. Many if not most of our users are not on Windows, or at least not only on Windows. UTF-16 is fairly widely used.

Anyway, I doubt that Windows will do that. Operating systems want to never break code like this, and these would all be duplicates.

Although I suppose they could do it as a header-only shim.

Microsoft was pretty unhappy with this change in ICU. They went with it because they were early in their integration of ICU into Windows.

They also have some fewer problems: I believe they concluded that the aliasing trick was so developer-hostile that they decided never to optimize based on it, at least for the types involved. I don't think our aliasing barrier is defined on Windows.

If u"literals" had just been uint16_t* without a new type, then we could have used string literals without changing API and breaking call sites, on most platforms anyway. And if uint16_t==wchar_t on Windows, then that would have been fine, too.

Note: Of course there are places where we use uint16_t* binary data, but there is never any confusion whether a function works with binary data vs. a string. You just wouldn't use the same function or name for unrelated operations.

Note also: While most of ICU works with UTF-16, we do have some UTF-8 functions. We distinguish the two with different function names, such as in class CaseMap (toLower() vs. utf8ToLower()).

If we had operations that worked on both UTF-8 and some other charset, we would also use different names.

Are code bases that use ICU on non-Windows platforms (slowly) migrating from uint16_t to char16_t?

I don't remember what Chromium and Android ended up doing. You could take a look at their code.

If you do want a distinct type, why not just standardize on uint8_t? Why does it need to be a new type that is distinct from that, too?

Lyberta provided one example; we do need to be able to overload or specialize on character vs integer types.

I don't find the examples so far convincing. Overloading on primitive types to distinguish between UTF-8 vs. one or more legacy charsets seems both unnecessary and like bad practice. Explicit naming of things that are different is good.

What makes sense to me is that "char" can be signed, and that's bad for dealing with non-ASCII characters.

In ICU, when I get to actual UTF-8 processing, I tend to either cast each byte to uint8_t or cast the whole pointer to uint8_t* and call an internal worker function.

Somewhat ironically, the fastest way to test for a UTF-8 trail byte is via the opposite cast, testing if (int8_t)b<-0x40.

This is why I said it would be much simpler if the "char" default could be changed to be unsigned.

I realize that non-portable code that assumes a signed char type would then need the opposite command-line option that people now use to force it to unsigned.

Since uint8_t is conditionally supported, we can't rely on its existence within the standard (we'd have to use unsigned char or uint_least8_t instead).

I seriously doubt that there is a platform that keeps up with modern C++ and does not have a real uint8_t.

ICU is one of the more widely portable libraries (or was, until we adopted C++11 and left some behind) and would likely fail royally if the uint8_t and uint16_t types we are using were actually wider than advertised and revealed larger values etc. Since ICU is also widely used, that would break a lot of systems. But no one has ever reported a bug (or request for porting patches) related to non-power-of-2 integer types.

I think there is value in maintaining consistency with char16_t and char32_t. char8_t provides the missing piece needed to enable a clean, type safe, external vs internal encoding model that allows use of any of UTF-8, UTF-16, or UTF-32 as the internal encoding, that is easy to teach, and that facilitates generic libraries like text_view that work seamlessly with any of these encodings.

Maybe. I don't see the need to use the same function names for a variety of legacy charsets vs. UTF-8.

20 years ago I wrote a set of macros that looked the same but had versions for UTF-8, UTF-16, and UTF-32. I briefly thought we could make (some of?) ICU essentially switchable between UTFs. I quickly learned that any real, non-trivial code you would want to write for either of them wants to be specific to that UTF, especially when people want text processing to be fast. (You can see remnants of this youthful folly in ICU's unicode/utf_old.h header file.)

Best regards,

markus