On 10/09/2018 01:39 AM, Markus Scherer wrote:
On Mon, Oct 8, 2018 at 7:45 PM Tom Honermann <tom@honermann.net> wrote:
On 10/08/2018 12:38 PM, Markus Scherer wrote:
ICU supports customization of its internal code unit type, but char16_t is used by default, following ICU’s adoption of C++11.

Not quite... ICU supports customization of its code unit type for C APIs. Internally, and in C++ APIs, we switched to char16_t. And because that broke call sites, we mitigated where we could with overloads and shim classes.

Ah, thank you for the correction.  If we end up submitting a revision of the paper, I'll include this correction.  I had checked the ICU sources (include/unicode/umachine.h) and verified that the UChar typedef was configurable, but I didn't realize that configuration was limited to C code.

We limited it to C API by doing s/UChar/char16_t/g in C++ API, except where we replaced a raw pointer with a shim class. So you won't see "UChar" in C++ API any more at all.

Internally to compiling ICU itself, we kept UChar in existing code (so that we didn't have to change tens of thousands of lines) but fixed it to be a typedef for char16_t.

Unfortunately, if UChar is configured != char16_t, you need casts or cast helpers for using C APIs from C++ code.

I see, thanks for the detail.  The C standard defines a (very) few functions in terms of the C char16_t typedef (mbrtoc16, c16rtomb).  Within C++, those functions are exposed in the std namespace as though they were declared with the C++ builtin char16_t type.  Has there been much consideration for similarly exposing ICU's C APIs to C++ consumers?  (This technique is not without complexities.  For example, attempting to take the address of an overloaded function without a cast may be ambiguous.  I'm just curious how much this or similar techniques were explored and what the conclusions were)


It would be interesting to get more perspective on how and why ICU evolved like it did.  What was the motivation for ICU to switch to char16_t?  Were the anticipated benefits realized despite the perhaps unanticipated complexities?

We assumed that C++ code was going to adopt char16_t and maybe std::u16string, and we wanted it to be easy for ICU to work with those types.

Perhaps that will still happen :)


In particular, the string literals weighed heavily. For the most part, we can ignore the standard library when it comes to Unicode, but the previous lack of real UTF-16 string literals was extremely inconvenient. We used to have all kinds of static const UChar arrays with numeric intializer lists, or init-once code for setting up string "constants", even when they contained only ASCII characters.

I remember doing similarly back in the day :)

I also remember looking forward to C99 compound literals so as to avoid the statics:

typedef unsigned char UChar;
typedef const UChar UTF16_LITERAL[];
void use(const UChar*);
void f() {
  use((UTF16_LITERAL){ 0x48 /*H*/, 0x69 /*i*/, 0 });
}

I prefer real literals :)


Now that we can use u"literals" we managed to clean up some of our code, and new library code and especially new unit test code benefits greatly.

If Windows were to suddenly sprout Win32 interfaces defined in terms of char16_t, would the pain be substantially relieved?

No. Many if not most of our users are not on Windows, or at least not only on Windows. UTF-16 is fairly widely used.

Anyway, I doubt that Windows will do that. Operating systems want to never break code like this, and these would all be duplicates.
Although I suppose they could do it as a header-only shim.

I've never heard of any plans to add such interfaces.  I was just curious that, if they were added, would it be helpful.  I suspect it would be helpful for Windows users, but perhaps not exceptionally so.


Microsoft was pretty unhappy with this change in ICU. They went with it because they were early in their integration of ICU into Windows.

They also have some fewer problems: I believe they concluded that the aliasing trick was so developer-hostile that they decided never to optimize based on it, at least for the types involved. I don't think our aliasing barrier is defined on Windows.

I can understand that.  It might make sense for us to consider allowing reinterpret_cast<char8_t*>(char_pointer_expression) to not be undefined behavior, at least as a deprecated feature.  We could actually specify this since the underlying type of char8_t would be the same everywhere (unlike char16_t).


If u"literals" had just been uint16_t* without a new type, then we could have used string literals without changing API and breaking call sites, on most platforms anyway. And if uint16_t==wchar_t on Windows, then that would have been fine, too.

How would that have been fine on Windows?  The reinterpret casts would still have been required.  I suspect the alias barrier would still be needed for non-Microsoft compilers on Windows.


Note: Of course there are places where we use uint16_t* binary data, but there is never any confusion whether a function works with binary data vs. a string. You just wouldn't use the same function or name for unrelated operations.

Note also: While most of ICU works with UTF-16, we do have some UTF-8 functions. We distinguish the two with different function names, such as in class CaseMap (toLower() vs. utf8ToLower()).

If we had operations that worked on both UTF-8 and some other charset, we would also use different names.

This may be where we have differing perspectives.  The trend in modern C++ is towards generic code and overloading plays an important role there.


Are code bases that use ICU on non-Windows platforms (slowly) migrating from uint16_t to char16_t?

I don't remember what Chromium and Android ended up doing. You could take a look at their code.

If you do want a distinct type, why not just standardize on uint8_t? Why does it need to be a new type that is distinct from that, too?
Lyberta provided one example; we do need to be able to overload or specialize on character vs integer types.

I don't find the examples so far convincing. Overloading on primitive types to distinguish between UTF-8 vs. one or more legacy charsets seems both unnecessary and like bad practice. Explicit naming of things that are different is good.

Lyberta provided one example, but there are others.  For example, serialization and logging libraries.  Consider a modern JSON library; it is convenient to be able to write code like the following that just works.

json_object player;
uint16_t scores[] = { 16, 27, 13 };
player["id"] = 42;
player["name"] = std::u16string("Skipper McGoof");
player["nickname"] = u"Goofy"; // stores a string
player["scores"] = scores;     // stores an array of numbers.

Note that the above works because uint16_t is effectively never defined in terms of a character type.  That isn't true for uint8_t.

Other examples come up in language binding libraries like sol2 where it is desirable to map native types across language boundaries.

Having different types for character data makes the above possible without having to hard-code for specific string types.  In the concepts enabled world that we are moving into, this enables us to write concepts like the following that can then be used to constrain functions intended to work only on string-like types.

template<typename T>
concept Character = AnySameUnqualified<T, char, wchar_t, char8_t, char16_t, char32_t>;
template<typename T>

concept String = Range<T> && Character<ValueType<T>>;

For the imaginary JSON example above, we might then write:

template<String S>
json_value::operator=(const S& s) {
  to_utf8_string(s);
};
template<Character C>
json_value::operator=(const C* s) {
  to_utf8_string(s);
};
template<Number T, std::size_t N>
json_value::operator=(const T (&a)[N]) {
  to_array(a);
};


What makes sense to me is that "char" can be signed, and that's bad for dealing with non-ASCII characters.

Yes, yes it is :)

In ICU, when I get to actual UTF-8 processing, I tend to either cast each byte to uint8_t or cast the whole pointer to uint8_t* and call an internal worker function.
Somewhat ironically, the fastest way to test for a UTF-8 trail byte is via the opposite cast, testing if (int8_t)b<-0x40.

Assuming a 2s complement representation, which we're nearly set to be able to assume in C++20 (http://wg21.link/p0907)!


This is why I said it would be much simpler if the "char" default could be changed to be unsigned.
I realize that non-portable code that assumes a signed char type would then need the opposite command-line option that people now use to force it to unsigned.

I haven't thought about this enough yet to have a sense of how big a change this would be.


Since uint8_t is conditionally supported, we can't rely on its existence within the standard (we'd have to use unsigned char or uint_least8_t instead).

I seriously doubt that there is a platform that keeps up with modern C++ and does not have a real uint8_t.

That may be.  Removing the conditionally supported qualification might be a possibility these days.  I'm really not sure.


ICU is one of the more widely portable libraries (or was, until we adopted C++11 and left some behind) and would likely fail royally if the uint8_t and uint16_t types we are using were actually wider than advertised and revealed larger values etc. Since ICU is also widely used, that would break a lot of systems. But no one has ever reported a bug (or request for porting patches) related to non-power-of-2 integer types.

I think there is value in maintaining consistency with char16_t and char32_tchar8_t provides the missing piece needed to enable a clean, type safe, external vs internal encoding model that allows use of any of UTF-8, UTF-16, or UTF-32 as the internal encoding, that is easy to teach, and that facilitates generic libraries like text_view that work seamlessly with any of these encodings.

Maybe. I don't see the need to use the same function names for a variety of legacy charsets vs. UTF-8.

I do.  Again, primarily for writing generic code.  I expect the need to do so to increase in modern C++.


20 years ago I wrote a set of macros that looked the same but had versions for UTF-8, UTF-16, and UTF-32. I briefly thought we could make (some of?) ICU essentially switchable between UTFs. I quickly learned that any real, non-trivial code you would want to write for either of them wants to be specific to that UTF, especially when people want text processing to be fast. (You can see remnants of this youthful folly in ICU's unicode/utf_old.h header file.)

I agree that when you get down to actually manipulating the text, you effectively need (chunks of) contiguous storage and encoding specific support and at that point, the desire to overload or specialize drops significantly.  The advantages in being able to deduce an encoding or overload/specialize appear at higher levels of abstraction - in code that only needs to recognize and direct text to the right low level function.

Tom.


Best regards,
markus