C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] code_unit_sequence

From: Lyberta <lyberta_at_[hidden]>
Date: Wed, 17 Jul 2019 22:11:00 +0000
Steve Downey:
> What interfaces is utf8_code_unit likely to appear in? I'm not sure I see
> the value in a strong type here, whereas I can see it for code_point and
> scalar_value. I expect most conversion operations to translate from untyped
> raw data, most likely byte, char, or char8_t, directly to code_point or
> scalar_value? There's some special cases for utf-8 / 16 conversions, but
> those are still likely to be on parts of raw buffers or in the vicinity of
> OS interfaces. At least that's been my experience.

The strong type is used to enforce stronger invariants. With dumb types
you can shoot yourself in the foot easily:

char8_t cu = 0xC0; // Invalid UTF-8 code unit, yet compiles

char16_t cu1 = 300;
char8_t cu2 = cu1; // Makes no sense, yet compiles

With my proposal:

std::unicode::utf8_code_unit cu{0xC0}; // Compile time error

std::unicode::utf16_code_unit cu1{300};
std::unicode::utf8_code_unit cu2 = cu1; // Compile time error

Modern C++ is all about strong types. std::chrono doesn't use dumb types
because that would be a disaster.

Lastly, charN_t is really really horrible name for a type. We should
remove it from the standard, maybe by 2040 or so.

Oh right, Niall Douglas asked about other languages. If you gonna have a
"char" type at all, do it right. Do it like Swift where "char" is an
extended grapheme cluster because that is the most meaningful definition
for something so ambiguous as character.

So when we remove "char" from the language, the users would be able to write

using char = std::unicode::grapheme_cluster;

I'm fine with that. But in the user code, not in the standard library.


Received on 2019-07-18 00:11:21