sg16: Re: [SG16-Unicode] Draft SG16 direction paper

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 17 Oct 2018 00:16:38 -0400

On 10/16/2018 05:58 PM, Markus Scherer wrote:
> On Tue, Oct 9, 2018 at 8:57 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> The C standard defines a (very) few functions in terms of the C
> char16_t typedef (mbrtoc16, c16rtomb). Within C++, those functions
> are exposed in the std namespace as though they were declared with
> the C++ builtin char16_t type. Has there been much consideration
> for similarly exposing ICU's C APIs to C++ consumers?
>
>
> C++ code calls ICU C APIs all the time.

Of course, sorry, I wasn't very clear with that question. Let me try
again. I was responding to this quote:

> Unfortunately, if UChar is configured != char16_t, you need casts or
cast helpers for using C APIs from C++ code.

The question is, effectively, whether consideration has been given to
providing cast helpers in a manner similar to how standard C++ provides
access to standard C functions; e.g., by exposing cast helpers in a C++
namespace. More concretely, whether something like the following has
been considered:

    U_STABLE UChar * U_EXPORT2
    u_strchr(const UChar *s, UChar c);

    #if defined(__cplusplus)
    namespace icu {
       char16_t * U_EXPORT2
       u_strchr(const char16_t *s, char16_t c);
    };
    #endif /* __cplusplus */

Noting that there are methods on at least some platforms that avoid
having to actually write a definition for the namespace scoped signature
when the functions have compatible calling conventions.

> People use C APIs because they can be binary stable, and they want to
> be able to link with multiple versions of the ICU DLL.

Indeed.

>
> People who call C++ APIs either tightly control DLL versions or link
> everything statically.

Despite not wanting to...

>
> It would be really nice if it was feasible to provide stable C++ API
> from a shared library.

but having to because of this :)

>
> (This technique is not without complexities. For example,
> attempting to take the address of an overloaded function without a
> cast may be ambiguous. I'm just curious how much this or similar
> techniques were explored and what the conclusions were)
>
>
> Not sure what the question is.
> There is of course no overloading on C APIs.

Hopefully I've clarified this above.

>
>> If u"literals" had just been uint16_t* without a new type, then
>> we could have used string literals without changing API and
>> breaking call sites, on most platforms anyway. And if
>> uint16_t==wchar_t on Windows, then that would have been fine, too.
>
> How would that have been fine on Windows? The reinterpret casts
> would still have been required.
>
>
> Why? If the two types had been typedefs of each other, there would
> need not be any casts.

I overlooked your mention of uint16_t==wchar_t. However, uint16_t was
added in C99 and I suspect it would have already been too late to define
it as wchar_t when u"literals" were adopted. Additionally, that would
have resulted in the same problems that we now face with int8_t commonly
being defined in terms of a character type.

>
> Lyberta provided one example, but there are others. For example,
> serialization and logging libraries. Consider a modern JSON
> library; it is convenient to be able to write code like the
> following that just works.
>
> json_object player;
> uint16_t scores[] = { 16, 27, 13 };
> player["id"] = 42;
> player["name"] = std::u16string("Skipper McGoof");
> player["nickname"] = u"Goofy"; // stores a string
> player["scores"] = scores; // stores an array of numbers.
>
> Note that the above works because uint16_t is effectively never
> defined in terms of a character type.
>
>
> Sure, but that feels like cherry-picking: You introduce one new type
> for one specific kind of thing (a pointer to certain units holding a
> string), but every other data that's a vector of essentially the same
> base units is still not distinguishable -- you wouldn't be able to
> distinguish scores from coordinates from other lists of numbers etc.

That is a fair criticism. The trend is to improve the ability to
distinguish such unit kinds. We see this in the C++20 std::chrono
library and other libraries like https://github.com/nholthaus/units.
C++11 user defined literals (despite some usability issues) are intended
to help in this respect. Where we have core language features (e.g.,
string literals), I think it is reasonable to be able to differentiate
them without having to further decorate them.

Tom.

>
> Having different types for character data makes the above possible
> without having to hard-code for specific string types. In the
> concepts enabled world that we are moving into, this enables us to
> write concepts like the following that can then be used to
> constrain functions intended to work only on string-like types.
>
>
> I take your word for it. I know nothing about "concepts".
>
>> In ICU, when I get to actual UTF-8 processing, I tend to either
>> cast each byte to uint8_t or cast the whole pointer to uint8_t*
>> and call an internal worker function.
>> Somewhat ironically, the fastest way to test for a UTF-8 trail
>> byte is via the opposite cast, testing if (int8_t)b<-0x40.
>
> Assuming a 2s complement representation, which we're nearly set to
> be able to assume in C++20 (http://wg21.link/p0907)!
>
>
> Well, this is nice! Especially
>
> /Change/ Right-shift is an arithmetic right shift which performs
> sign-extension.
>
> which should get static-analysis tools off our backs.
>
> Only because those have complained about code where we use arithmetic
> right shifts did I have to make a macro that does the normal
> (signed>>num_bits) on normal compilers, and a manual sign extension
> when compiling for static analysis...
> I don't think it's been an issue on any real compiler. All machines
> that anyone ever ported ICU to seem to use two's-complement integers
> of 8/16/32/... bits.
>
> markus

Received on 2018-10-17 06:16:42