On 10/08/2018 12:38 PM, Markus Scherer
supports customization of its internal code unit type, but
used by default, following ICU’s adoption of C++11.
Not quite... ICU supports customization of its code unit
type for C APIs. Internally, and in C++ APIs,
we switched to char16_t. And because that broke call sites, we
mitigated where we could with overloads and shim classes.
Ah, thank you for the correction. If we end up submitting a
revision of the paper, I'll include this correction. I had checked
the ICU sources (include/unicode/umachine.h) and verified
that the UChar typedef was configurable, but I didn't
realize that configuration was limited to C code.
This was all quite painful.
I believe that. I discovered the U_ALIASING_BARRIER macro
used to work around the fact that, for example, reinterpret_cast<const
wchar_t*> from a pointer to char16_t results in
undefined behavior. The need for such heroics is a bit more limited
for char8_t since char and unsigned char
are allowed to alias with char8_t (though not the other
It would be interesting to get more perspective on how and why ICU
evolved like it did. What was the motivation for ICU to switch to char16_t?
Were the anticipated benefits realized despite the perhaps
unanticipated complexities? If Windows were to suddenly sprout
Win32 interfaces defined in terms of char16_t, would the
pain be substantially relieved? Are code bases that use ICU on
non-Windows platforms (slowly) migrating from uint16_t to
As for char8_t, I realize that you think the benefits
outweigh the costs.
I asked some C++ experts about the potential for
performance gains from better optimizations; one responded
with a skeptical note.
This is something I would like to get more data on. I've looked and
I've asked, but so far haven't found any research that attempts to
quantify the lost optimization cost due to aliasing char.
I've heard claims that it is significant, but have not seen data to
support such claims. The benefits of TBAA in general are not
disputed, and it seems reasonable to conclude that there is
therefore a lost opportunity if TBAA cannot be applied fully for char.
But whether that opportunity is large or small I really don't know.
In theory, we could use the current support in gcc and Clang for char8_t
to explore this further.
If you do want a distinct type, why not just standardize on
uint8_t? Why does it need to be a new type that is distinct
from that, too?
Lyberta provided one example; we do need to be able to overload or
specialize on character vs integer types. Since uint8_t
is conditionally supported, we can't rely on its existence within
the standard (we'd have to use unsigned char or uint_least8_t
I think there is value in maintaining consistency with char16_t
and char32_t. char8_t provides the missing
piece needed to enable a clean, type safe, external vs internal
encoding model that allows use of any of UTF-8, UTF-16, or UTF-32 as
the internal encoding, that is easy to teach, and that facilitates
generic libraries like text_view that work seamlessly with any of