sg16: Re: [SG16-Unicode] Draft SG16 direction paper

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 9 Oct 2018 23:57:30 -0400

On 10/09/2018 01:39 AM, Markus Scherer wrote:
> On Mon, Oct 8, 2018 at 7:45 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 10/08/2018 12:38 PM, Markus Scherer wrote:
>> > ICU supports customization of its internal code unit type, but
>> |char16_t| is used by default, following ICU’s adoption of C++11.
>>
>> Not quite... ICU supports customization of its code unit type
>> _/for C APIs/_. Internally, and in C++ APIs, we switched to
>> char16_t. And because that broke call sites, we mitigated where
>> we could with overloads and shim classes.
>
> Ah, thank you for the correction. If we end up submitting a
> revision of the paper, I'll include this correction. I had
> checked the ICU sources (include/unicode/umachine.h) and verified
> that the UChar typedef was configurable, but I didn't realize that
> configuration was limited to C code.
>
>
> We limited it to C API by doing s/UChar/char16_t/g in C++ API, except
> where we replaced a raw pointer with a shim class. So you won't see
> "UChar" in C++ API any more at all.
>
> Internally to compiling ICU itself, we kept UChar in existing code (so
> that we didn't have to change tens of thousands of lines) but fixed it
> to be a typedef for char16_t.
>
> Unfortunately, if UChar is configured != char16_t, you need casts or
> cast helpers for using C APIs from C++ code.

I see, thanks for the detail. The C standard defines a (very) few
functions in terms of the C char16_t typedef (mbrtoc16, c16rtomb).
Within C++, those functions are exposed in the std namespace as though
they were declared with the C++ builtin char16_t type. Has there been
much consideration for similarly exposing ICU's C APIs to C++
consumers? (This technique is not without complexities. For example,
attempting to take the address of an overloaded function without a cast
may be ambiguous. I'm just curious how much this or similar techniques
were explored and what the conclusions were)

>
> It would be interesting to get more perspective on how and why ICU
> evolved like it did. What was the motivation for ICU to switch to
> char16_t? Were the anticipated benefits realized despite the
> perhaps unanticipated complexities?
>
>
> We assumed that C++ code was going to adopt char16_t and maybe
> std::u16string, and we wanted it to be easy for ICU to work with those
> types.

Perhaps that will still happen :)

>
> In particular, the string literals weighed heavily. For the most part,
> we can ignore the standard library when it comes to Unicode, but the
> previous lack of real UTF-16 string literals was extremely
> inconvenient. We used to have all kinds of static const UChar arrays
> with numeric intializer lists, or init-once code for setting up string
> "constants", even when they contained only ASCII characters.

I remember doing similarly back in the day :)

I also remember looking forward to C99 compound literals so as to avoid
the statics:

typedef unsigned char UChar;
typedef const UChar UTF16_LITERAL[];
void use(const UChar*);
void f() {
   use((UTF16_LITERAL){ 0x48 /*H*/, 0x69 /*i*/, 0 });
}

I prefer real literals :)

>
> Now that we can use u"literals" we managed to clean up some of our
> code, and new library code and especially new unit test code benefits
> greatly.
>
> If Windows were to suddenly sprout Win32 interfaces defined in
> terms of char16_t, would the pain be substantially relieved?
>
>
> No. Many if not most of our users are not on Windows, or at least not
> only on Windows. UTF-16 is fairly widely used.
>
> Anyway, I doubt that Windows will do that. Operating systems want to
> never break code like this, and these would all be duplicates.
> Although I suppose they could do it as a header-only shim.

I've never heard of any plans to add such interfaces. I was just
curious that, if they were added, would it be helpful. I suspect it
would be helpful for Windows users, but perhaps not exceptionally so.

>
> Microsoft was pretty unhappy with this change in ICU. They went with
> it because they were early in their integration of ICU into Windows.
>
> They also have some fewer problems: I believe they concluded that the
> aliasing trick was so developer-hostile that they decided never to
> optimize based on it, at least for the types involved. I don't think
> our aliasing barrier is defined on Windows.

I can understand that. It might make sense for us to consider allowing
reinterpret_cast<char8_t*>(char_pointer_expression) to not be undefined
behavior, at least as a deprecated feature. We could actually specify
this since the underlying type of char8_t would be the same everywhere
(unlike char16_t).

>
> If u"literals" had just been uint16_t* without a new type, then we
> could have used string literals without changing API and breaking call
> sites, on most platforms anyway. And if uint16_t==wchar_t on Windows,
> then that would have been fine, too.

How would that have been fine on Windows? The reinterpret casts would
still have been required. I suspect the alias barrier would still be
needed for non-Microsoft compilers on Windows.

>
> Note: Of course there are places where we use uint16_t* binary data,
> but there is never any confusion whether a function works with binary
> data vs. a string. You just wouldn't use the same function or name for
> unrelated operations.
>
> Note also: While most of ICU works with UTF-16, we do have some UTF-8
> functions. We distinguish the two with different function names, such
> as in class CaseMap
> <http://icu-project.org/apiref/icu4c/classicu_1_1CaseMap.html>
> (toLower() vs. utf8ToLower()).
>
> If we had operations that worked on both UTF-8 and some other charset,
> we would also use different names.

This may be where we have differing perspectives. The trend in modern
C++ is towards generic code and overloading plays an important role there.

>
> Are code bases that use ICU on non-Windows platforms (slowly)
> migrating from uint16_t to char16_t?
>
>
> I don't remember what Chromium and Android ended up doing. You could
> take a look at their code.
>
>> If you do want a distinct type, why not just standardize on
>> uint8_t? Why does it need to be a new type that is distinct from
>> that, too?
> Lyberta provided one example; we do need to be able to overload or
> specialize on character vs integer types.
>
>
> I don't find the examples so far convincing. Overloading on primitive
> types to distinguish between UTF-8 vs. one or more legacy charsets
> seems both unnecessary and like bad practice. Explicit naming of
> things that are different is good.

Lyberta provided one example, but there are others. For example,
serialization and logging libraries. Consider a modern JSON library; it
is convenient to be able to write code like the following that just works.

json_object player;
uint16_t scores[] = { 16, 27, 13 };
player["id"] = 42;
player["name"] = std::u16string("Skipper McGoof");
player["nickname"] = u"Goofy"; // stores a string
player["scores"] = scores; // stores an array of numbers.

Note that the above works because uint16_t is effectively never defined
in terms of a character type. That isn't true for uint8_t.

Other examples come up in language binding libraries like sol2 where it
is desirable to map native types across language boundaries.

Having different types for character data makes the above possible
without having to hard-code for specific string types. In the concepts
enabled world that we are moving into, this enables us to write concepts
like the following that can then be used to constrain functions intended
to work only on string-like types.

template<typename T>
concept Character = AnySameUnqualified<T, char, wchar_t, char8_t,
char16_t, char32_t>;
template<typename T>
concept String = Range<T> && Character<ValueType<T>>;

For the imaginary JSON example above, we might then write:

template<String S>
json_value::operator=(const S& s) {
   to_utf8_string(s);
};
template<Character C>
json_value::operator=(const C* s) {
   to_utf8_string(s);
};
template<Number T, std::size_t N>
json_value::operator=(const T (&a)[N]) {
   to_array(a);
};

>
> What makes sense to me is that "char" can be signed, and that's bad
> for dealing with non-ASCII characters.

Yes, yes it is :)

> In ICU, when I get to actual UTF-8 processing, I tend to either cast
> each byte to uint8_t or cast the whole pointer to uint8_t* and call an
> internal worker function.
> Somewhat ironically, the fastest way to test for a UTF-8 trail byte is
> via the opposite cast, testing if (int8_t)b<-0x40.

Assuming a 2s complement representation, which we're nearly set to be
able to assume in C++20 (http://wg21.link/p0907)!

>
> This is why I said it would be much simpler if the "char" default
> could be changed to be unsigned.
> I realize that non-portable code that assumes a signed char type would
> then need the opposite command-line option that people now use to
> force it to unsigned.

I haven't thought about this enough yet to have a sense of how big a
change this would be.

>
> Since uint8_t is conditionally supported, we can't rely on its
> existence within the standard (we'd have to use unsigned char or
> uint_least8_t instead).
>
>
> I seriously doubt that there is a platform that keeps up with modern
> C++ and does not have a real uint8_t.

That may be. Removing the conditionally supported qualification might
be a possibility these days. I'm really not sure.

>
> ICU is one of the more widely portable libraries (or was, until we
> adopted C++11 and left some behind) and would likely fail royally if
> the uint8_t and uint16_t types we are using were actually wider than
> advertised and revealed larger values etc. Since ICU is also widely
> used, that would break a lot of systems. But no one has ever reported
> a bug (or request for porting patches) related to non-power-of-2
> integer types.
>
> I think there is value in maintaining consistency with char16_t
> and char32_t. char8_t provides the missing piece needed to enable
> a clean, type safe, external vs internal encoding model that
> allows use of any of UTF-8, UTF-16, or UTF-32 as the internal
> encoding, that is easy to teach, and that facilitates generic
> libraries like text_view that work seamlessly with any of these
> encodings.
>
>
> Maybe. I don't see the need to use the same function names for a
> variety of legacy charsets vs. UTF-8.

I do. Again, primarily for writing generic code. I expect the need to
do so to increase in modern C++.

>
> 20 years ago I wrote a set of macros that looked the same but had
> versions for UTF-8, UTF-16, and UTF-32. I briefly thought we could
> make (some of?) ICU essentially switchable between UTFs. I quickly
> learned that any real, non-trivial code you would want to write for
> either of them wants to be specific to that UTF, especially when
> people want text processing to be fast. (You can see remnants of this
> youthful folly in ICU's unicode/utf_old.h header file.)

I agree that when you get down to actually manipulating the text, you
effectively need (chunks of) contiguous storage and encoding specific
support and at that point, the desire to overload or specialize drops
significantly. The advantages in being able to deduce an encoding or
overload/specialize appear at higher levels of abstraction - in code
that only needs to recognize and direct text to the right low level
function.

Tom.

>
> Best regards,
> markus

Received on 2018-10-10 05:57:33