C++ Logo


Advanced search

Re: [SG16-Unicode] Draft SG16 direction paper

From: Markus Scherer <markus.icu_at_[hidden]>
Date: Tue, 16 Oct 2018 14:58:33 -0700
On Tue, Oct 9, 2018 at 8:57 PM Tom Honermann <tom_at_[hidden]> wrote:

> The C standard defines a (very) few functions in terms of the C char16_t
> typedef (mbrtoc16, c16rtomb). Within C++, those functions are exposed in
> the std namespace as though they were declared with the C++ builtin
> char16_t type. Has there been much consideration for similarly exposing
> ICU's C APIs to C++ consumers?

C++ code calls ICU C APIs all the time.
People use C APIs because they can be binary stable, and they want to be
able to link with multiple versions of the ICU DLL.

People who call C++ APIs either tightly control DLL versions or link
everything statically.

It would be really nice if it was feasible to provide stable C++ API from a
shared library.

(This technique is not without complexities. For example, attempting to
> take the address of an overloaded function without a cast may be
> ambiguous. I'm just curious how much this or similar techniques were
> explored and what the conclusions were)

Not sure what the question is.
There is of course no overloading on C APIs.

If u"literals" had just been uint16_t* without a new type, then we could
> have used string literals without changing API and breaking call sites, on
> most platforms anyway. And if uint16_t==wchar_t on Windows, then that would
> have been fine, too.
> How would that have been fine on Windows? The reinterpret casts would
> still have been required.

Why? If the two types had been typedefs of each other, there would need not
be any casts.

Lyberta provided one example, but there are others. For example,
> serialization and logging libraries. Consider a modern JSON library; it is
> convenient to be able to write code like the following that just works.
> json_object player;
> uint16_t scores[] = { 16, 27, 13 };
> player["id"] = 42;
> player["name"] = std::u16string("Skipper McGoof");
> player["nickname"] = u"Goofy"; // stores a string
> player["scores"] = scores; // stores an array of numbers.
> Note that the above works because uint16_t is effectively never defined
> in terms of a character type.

Sure, but that feels like cherry-picking: You introduce one new type for
one specific kind of thing (a pointer to certain units holding a string),
but every other data that's a vector of essentially the same base units is
still not distinguishable -- you wouldn't be able to distinguish scores
from coordinates from other lists of numbers etc.

Having different types for character data makes the above possible without
> having to hard-code for specific string types. In the concepts enabled
> world that we are moving into, this enables us to write concepts like the
> following that can then be used to constrain functions intended to work
> only on string-like types.

I take your word for it. I know nothing about "concepts".

In ICU, when I get to actual UTF-8 processing, I tend to either cast each
> byte to uint8_t or cast the whole pointer to uint8_t* and call an internal
> worker function.
> Somewhat ironically, the fastest way to test for a UTF-8 trail byte is via
> the opposite cast, testing if (int8_t)b<-0x40.
> Assuming a 2s complement representation, which we're nearly set to be able
> to assume in C++20 (http://wg21.link/p0907)!

Well, this is nice! Especially

*Change* Right-shift is an arithmetic right shift which performs

which should get static-analysis tools off our backs.

Only because those have complained about code where we use arithmetic right
shifts did I have to make a macro that does the normal (signed>>num_bits)
on normal compilers, and a manual sign extension when compiling for static
I don't think it's been an issue on any real compiler. All machines that
anyone ever ported ICU to seem to use two's-complement integers of
8/16/32/... bits.


Received on 2018-10-16 23:58:48