C++ Logo


Advanced search

Re: [SG16-Unicode] code_unit_sequence

From: Lyberta <lyberta_at_[hidden]>
Date: Wed, 17 Jul 2019 23:14:00 +0000
Steve Downey:
> In live code, data is dynamic, and a code_unit, particularly a utf-8
> code unit, doesn't show up in isolation, they show up in sequences, but I
> fail to see why I'd want a sequence of code_units, as I'm immediately going
> to have to interpret them into something useful.

Yes. But code unit level is still needed and using std::basic_string for
it seems like a bad idea because it contains std::char_traits, bloated
API, NUL-terminator and dumb types. All the stuff that made some sense
in the 1990s but doesn't make much sense now.

Again, there will be "scalar_value_sequence",
"grapheme_cluster_sequence" and "text" on top. code_unit_sequence is a
low level thing. But a thing we need in low level code.

> What are the operations
> on a utf8_code_unit? What interfaces does it show up in as a vocabulary
> type?

utf8_code_unit has the following member functions:

constexpr value_type value() const noexcept;
constexpr bool is_ascii() const noexcept;
constexpr bool is_leading_byte() const noexcept;
constexpr bool is_continuation_byte() const noexcept;

Those are exposed for encoding forms and people who want to learn more
about Unicode. You can read the full text of proposal here:


> What is the overhead on it when used in bulk?

As the type is trivially copyable and relocatable, there shouldn't be
any overhead in release builds.

> Single code_unit validity isn't enough to get even well formed utf-8, so a
> significant part of error handling is still going to be present in
> processing.

Yes, but it makes conversion to scalar values much easier because that
check automatically prohibits overlong sequences.

Received on 2019-07-18 01:14:17