C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] code_unit_sequence

From: Steve Downey <sdowney_at_[hidden]>
Date: Wed, 17 Jul 2019 18:39:06 -0400
Neither
char8_t cu = 0xC0;
or
std::unicode::utf8_code_unit cu{0xC0};
are bits of code that I'm likely to write, except very possibly as test
cases. In live code, data is dynamic, and a code_unit, particularly a utf-8
code unit, doesn't show up in isolation, they show up in sequences, but I
fail to see why I'd want a sequence of code_units, as I'm immediately going
to have to interpret them into something useful. What are the operations
on a utf8_code_unit? What interfaces does it show up in as a vocabulary
type? What is the overhead on it when used in bulk?

Single code_unit validity isn't enough to get even well formed utf-8, so a
significant part of error handling is still going to be present in
processing.

On Wed, Jul 17, 2019 at 6:11 PM Lyberta <lyberta_at_[hidden]> wrote:

> Steve Downey:
> > What interfaces is utf8_code_unit likely to appear in? I'm not sure I see
> > the value in a strong type here, whereas I can see it for code_point and
> > scalar_value. I expect most conversion operations to translate from
> untyped
> > raw data, most likely byte, char, or char8_t, directly to code_point or
> > scalar_value? There's some special cases for utf-8 / 16 conversions, but
> > those are still likely to be on parts of raw buffers or in the vicinity
> of
> > OS interfaces. At least that's been my experience.
>
> The strong type is used to enforce stronger invariants. With dumb types
> you can shoot yourself in the foot easily:
>
> char8_t cu = 0xC0; // Invalid UTF-8 code unit, yet compiles
>
> char16_t cu1 = 300;
> char8_t cu2 = cu1; // Makes no sense, yet compiles
>
> With my proposal:
>
> std::unicode::utf8_code_unit cu{0xC0}; // Compile time error
>
> std::unicode::utf16_code_unit cu1{300};
> std::unicode::utf8_code_unit cu2 = cu1; // Compile time error
>
> Modern C++ is all about strong types. std::chrono doesn't use dumb types
> because that would be a disaster.
>
> Lastly, charN_t is really really horrible name for a type. We should
> remove it from the standard, maybe by 2040 or so.
>
> Oh right, Niall Douglas asked about other languages. If you gonna have a
> "char" type at all, do it right. Do it like Swift where "char" is an
> extended grapheme cluster because that is the most meaningful definition
> for something so ambiguous as character.
>
> So when we remove "char" from the language, the users would be able to
> write
>
> using char = std::unicode::grapheme_cluster;
>
> I'm fine with that. But in the user code, not in the standard library.
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-07-18 00:39:20