On Tue, Oct 9, 2018 at 8:57 PM Tom Honermann <tom@honermann.net> wrote:

The C standard defines a (very) few functions in terms of the C char16_t typedef (mbrtoc16, c16rtomb). Within C++, those functions are exposed in the std namespace as though they were declared with the C++ builtin char16_t type. Has there been much consideration for similarly exposing ICU's C APIs to C++ consumers?

C++ code calls ICU C APIs all the time.

People use C APIs because they can be binary stable, and they want to be able to link with multiple versions of the ICU DLL.

People who call C++ APIs either tightly control DLL versions or link everything statically.

It would be really nice if it was feasible to provide stable C++ API from a shared library.

(This technique is not without complexities. For example, attempting to take the address of an overloaded function without a cast may be ambiguous. I'm just curious how much this or similar techniques were explored and what the conclusions were)

Not sure what the question is.

There is of course no overloading on C APIs.

If u"literals" had just been uint16_t* without a new type, then we could have used string literals without changing API and breaking call sites, on most platforms anyway. And if uint16_t==wchar_t on Windows, then that would have been fine, too.

How would that have been fine on Windows? The reinterpret casts would still have been required.

Why? If the two types had been typedefs of each other, there would need not be any casts.

Lyberta provided one example, but there are others. For example, serialization and logging libraries. Consider a modern JSON library; it is convenient to be able to write code like the following that just works.

json_object player;
uint16_t scores[] = { 16, 27, 13 }; player["id"] = 42; player["name"] = std::u16string("Skipper McGoof"); player["nickname"] = u"Goofy"; // stores a string player["scores"] = scores; // stores an array of numbers.
Note that the above works because uint16_t is effectively never defined in terms of a character type.

Sure, but that feels like cherry-picking: You introduce one new type for one specific kind of thing (a pointer to certain units holding a string), but every other data that's a vector of essentially the same base units is still not distinguishable -- you wouldn't be able to distinguish scores from coordinates from other lists of numbers etc.

Having different types for character data makes the above possible without having to hard-code for specific string types. In the concepts enabled world that we are moving into, this enables us to write concepts like the following that can then be used to constrain functions intended to work only on string-like types.

I take your word for it. I know nothing about "concepts".

In ICU, when I get to actual UTF-8 processing, I tend to either cast each byte to uint8_t or cast the whole pointer to uint8_t* and call an internal worker function.

Somewhat ironically, the fastest way to test for a UTF-8 trail byte is via the opposite cast, testing if (int8_t)b<-0x40.

Assuming a 2s complement representation, which we're nearly set to be able to assume in C++20 (http://wg21.link/p0907)!

Well, this is nice! Especially

Change Right-shift is an arithmetic right shift which performs sign-extension.

which should get static-analysis tools off our backs.

Only because those have complained about code where we use arithmetic right shifts did I have to make a macro that does the normal (signed>>num_bits) on normal compilers, and a manual sign extension when compiling for static analysis...

I don't think it's been an issue on any real compiler. All machines that anyone ever ported ICU to seem to use two's-complement integers of 8/16/32/... bits.

markus