On 10/16/2018 05:58 PM, Markus Scherer wrote:
On Tue, Oct 9, 2018 at 8:57 PM Tom Honermann <tom@honermann.net> wrote:
The C standard defines a (very) few functions in terms of the C char16_t typedef (mbrtoc16, c16rtomb).  Within C++, those functions are exposed in the std namespace as though they were declared with the C++ builtin char16_t type.  Has there been much consideration for similarly exposing ICU's C APIs to C++ consumers?

C++ code calls ICU C APIs all the time.

Of course, sorry, I wasn't very clear with that question.  Let me try again.  I was responding to this quote:

> Unfortunately, if UChar is configured != char16_t, you need casts or cast helpers for using C APIs from C++ code.

The question is, effectively, whether consideration has been given to providing cast helpers in a manner similar to how standard C++ provides access to standard C functions; e.g., by exposing cast helpers in a C++ namespace.  More concretely, whether something like the following has been considered:
U_STABLE UChar * U_EXPORT2
u_strchr(const UChar *s, UChar c);

#if defined(__cplusplus)
namespace icu {
  char16_t * U_EXPORT2
  u_strchr(const char16_t *s, char16_t c);
};
#endif /* __cplusplus */
Noting that there are methods on at least some platforms that avoid having to actually write a definition for the namespace scoped signature when the functions have compatible calling conventions.

People use C APIs because they can be binary stable, and they want to be able to link with multiple versions of the ICU DLL.

Indeed.


People who call C++ APIs either tightly control DLL versions or link everything statically.

Despite not wanting to...


It would be really nice if it was feasible to provide stable C++ API from a shared library.

but having to because of this :)


(This technique is not without complexities.  For example, attempting to take the address of an overloaded function without a cast may be ambiguous.  I'm just curious how much this or similar techniques were explored and what the conclusions were)

Not sure what the question is.
There is of course no overloading on C APIs.

Hopefully I've clarified this above.


If u"literals" had just been uint16_t* without a new type, then we could have used string literals without changing API and breaking call sites, on most platforms anyway. And if uint16_t==wchar_t on Windows, then that would have been fine, too.

How would that have been fine on Windows?  The reinterpret casts would still have been required.

Why? If the two types had been typedefs of each other, there would need not be any casts.

I overlooked your mention of uint16_t==wchar_t.  However, uint16_t was added in C99 and I suspect it would have already been too late to define it as wchar_t when u"literals" were adopted.  Additionally, that would have resulted in the same problems that we now face with int8_t commonly being defined in terms of a character type.


Lyberta provided one example, but there are others.  For example, serialization and logging libraries.  Consider a modern JSON library; it is convenient to be able to write code like the following that just works.

json_object player;
uint16_t scores[] = { 16, 27, 13 };
player["id"] = 42;
player["name"] = std::u16string("Skipper McGoof");
player["nickname"] = u"Goofy"; // stores a string
player["scores"] = scores;     // stores an array of numbers.

Note that the above works because uint16_t is effectively never defined in terms of a character type.

Sure, but that feels like cherry-picking: You introduce one new type for one specific kind of thing (a pointer to certain units holding a string), but every other data that's a vector of essentially the same base units is still not distinguishable -- you wouldn't be able to distinguish scores from coordinates from other lists of numbers etc.

That is a fair criticism.  The trend is to improve the ability to distinguish such unit kinds.  We see this in the C++20 std::chrono library and other libraries like https://github.com/nholthaus/units.  C++11 user defined literals (despite some usability issues) are intended to help in this respect.  Where we have core language features (e.g., string literals), I think it is reasonable to be able to differentiate them without having to further decorate them.

Tom.


Having different types for character data makes the above possible without having to hard-code for specific string types.  In the concepts enabled world that we are moving into, this enables us to write concepts like the following that can then be used to constrain functions intended to work only on string-like types.

I take your word for it. I know nothing about "concepts".

In ICU, when I get to actual UTF-8 processing, I tend to either cast each byte to uint8_t or cast the whole pointer to uint8_t* and call an internal worker function.
Somewhat ironically, the fastest way to test for a UTF-8 trail byte is via the opposite cast, testing if (int8_t)b<-0x40.

Assuming a 2s complement representation, which we're nearly set to be able to assume in C++20 (http://wg21.link/p0907)!

Well, this is nice! Especially
Change Right-shift is an arithmetic right shift which performs sign-extension.
which should get static-analysis tools off our backs.

Only because those have complained about code where we use arithmetic right shifts did I have to make a macro that does the normal (signed>>num_bits) on normal compilers, and a manual sign extension when compiling for static analysis...
I don't think it's been an issue on any real compiler. All machines that anyone ever ported ICU to seem to use two's-complement integers of 8/16/32/... bits.

markus