sg16: Re: [SG16-Unicode] Draft SG16 direction paper

From: Markus Scherer <markus.icu_at_[hidden]>
Date: Tue, 16 Oct 2018 14:58:33 -0700

On Tue, Oct 9, 2018 at 8:57 PM Tom Honermann <tom_at_[hidden]> wrote:

> The C standard defines a (very) few functions in terms of the C char16_t
> typedef (mbrtoc16, c16rtomb). Within C++, those functions are exposed in
> the std namespace as though they were declared with the C++ builtin
> char16_t type. Has there been much consideration for similarly exposing
> ICU's C APIs to C++ consumers?
>

C++ code calls ICU C APIs all the time.
People use C APIs because they can be binary stable, and they want to be
able to link with multiple versions of the ICU DLL.

People who call C++ APIs either tightly control DLL versions or link
everything statically.

It would be really nice if it was feasible to provide stable C++ API from a
shared library.

(This technique is not without complexities. For example, attempting to
> take the address of an overloaded function without a cast may be
> ambiguous. I'm just curious how much this or similar techniques were
> explored and what the conclusions were)
>

Not sure what the question is.
There is of course no overloading on C APIs.

If u"literals" had just been uint16_t* without a new type, then we could
> have used string literals without changing API and breaking call sites, on
> most platforms anyway. And if uint16_t==wchar_t on Windows, then that would
> have been fine, too.
>
>
> How would that have been fine on Windows? The reinterpret casts would
> still have been required.
>

Why? If the two types had been typedefs of each other, there would need not
be any casts.

Lyberta provided one example, but there are others. For example,
> serialization and logging libraries. Consider a modern JSON library; it is
> convenient to be able to write code like the following that just works.
>
> json_object player;
> uint16_t scores[] = { 16, 27, 13 };
> player["id"] = 42;
> player["name"] = std::u16string("Skipper McGoof");
> player["nickname"] = u"Goofy"; // stores a string
> player["scores"] = scores; // stores an array of numbers.
>
> Note that the above works because uint16_t is effectively never defined
> in terms of a character type.
>

Sure, but that feels like cherry-picking: You introduce one new type for
one specific kind of thing (a pointer to certain units holding a string),
but every other data that's a vector of essentially the same base units is
still not distinguishable -- you wouldn't be able to distinguish scores
from coordinates from other lists of numbers etc.

Having different types for character data makes the above possible without
> having to hard-code for specific string types. In the concepts enabled
> world that we are moving into, this enables us to write concepts like the
> following that can then be used to constrain functions intended to work
> only on string-like types.
>

I take your word for it. I know nothing about "concepts".

In ICU, when I get to actual UTF-8 processing, I tend to either cast each
> byte to uint8_t or cast the whole pointer to uint8_t* and call an internal
> worker function.
> Somewhat ironically, the fastest way to test for a UTF-8 trail byte is via
> the opposite cast, testing if (int8_t)b<-0x40.
>
>
> Assuming a 2s complement representation, which we're nearly set to be able
> to assume in C++20 (http://wg21.link/p0907)!
>

Well, this is nice! Especially

*Change* Right-shift is an arithmetic right shift which performs
sign-extension.

which should get static-analysis tools off our backs.

Only because those have complained about code where we use arithmetic right
shifts did I have to make a macro that does the normal (signed>>num_bits)
on normal compilers, and a manual sign extension when compiling for static
analysis...
I don't think it's been an issue on any real compiler. All machines that
anyone ever ported ICU to seem to use two's-complement integers of
8/16/32/... bits.

markus

Received on 2018-10-16 23:58:48