sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Lyberta <lyberta_at_[hidden]>
Date: Tue, 19 Jun 2018 15:09:00 +0000

Mark Zeren:
> [mjz] This is one approach. Another is Zach's opinionated "there is only one storage container" approach.

Zach's approach is exactly what I don't want to see in the standard. His
type only supports UTF-8.

As we see with std::chrono. Encoding form should be a template
parameter. Nothing restricts us from standardizing
std::dynamic_encoding_form where code unit type is compile-time while
its meaning is determined at runtime.

I only advantage of std::basic_string over std::vector is Small Buffer
Optimization. Perhaps we can work with LEWG to standardize something
like sbo_vector. Then code_unit_sequence could just take it as template
parameter but require value_type be std::byte.

The heirarchy would then be from bottom to top:

* std::sbo_vector<std::byte>
* std::code_unit_sequence
* std::code_point_sequence
* std::text

Where each template will use the previous one in its implementation. Of
course, this is just the default hierarchy. A user can manually opt-in for:

* std::vector<char16_t> // For UTF-16 case, for example.
* std::code_point_sequence
* std::text

Or:

* std::vector<char32_t>
* std::text

Or even:

* std::vector<char32_t>
* std::code_point_sequence // Basically no-ops on this layer. This case
is typical for TMP.
* std::text

I'm baffled a bit about Zach's design. He goes 100% templates above the
code point level, there was no need to restrict his "string layer" to
UTF-8, especially since implementing code point iteration is much easier
than grapheme cluster and higher ones which he did implement.

Received on 2018-06-19 17:10:12