C++ Logo


Advanced search

Re: [SG16-Unicode] code_unit_sequence and code_point_sequence

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 19 Jun 2018 14:34:18 -0400
On 06/19/2018 11:09 AM, Lyberta wrote:
> Mark Zeren:
>> [mjz] This is one approach. Another is Zach's opinionated "there is only one storage container" approach.
> Zach's approach is exactly what I don't want to see in the standard. His
> type only supports UTF-8.

I also don't want to see a UTF-8 only text type; GB18030 is important in
China and UTF-16 isn't going away any time soon. And uses for Modified
UTF-8, CESU8, Shift-JIS, etc... will remain long in to the future.

> As we see with std::chrono. Encoding form should be a template
> parameter. Nothing restricts us from standardizing
> std::dynamic_encoding_form where code unit type is compile-time while
> its meaning is determined at runtime.


> I only advantage of std::basic_string over std::vector is Small Buffer
> Optimization. Perhaps we can work with LEWG to standardize something
> like sbo_vector.

There have been attempts. See P0274:
- http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0274r0.pdf

> Then code_unit_sequence could just take it as template
> parameter but require value_type be std::byte.
> The heirarchy would then be from bottom to top:
> * std::sbo_vector<std::byte>
> * std::code_unit_sequence
> * std::code_point_sequence
> * std::text

This is overspecification in my opinion. And like Martinho, I don't see
the point of code_unit_sequence (or code_point_sequence); that is a
concept, not a container.

> Where each template will use the previous one in its implementation. Of
> course, this is just the default hierarchy. A user can manually opt-in for:
> * std::vector<char16_t> // For UTF-16 case, for example.
> * std::code_point_sequence
> * std::text
> Or:
> * std::vector<char32_t>
> * std::text
> Or even:
> * std::vector<char32_t>
> * std::code_point_sequence // Basically no-ops on this layer. This case
> is typical for TMP.
> * std::text

Why do you think it is important to specify an underlying storage
container type for std::text?

> I'm baffled a bit about Zach's design. He goes 100% templates above the
> code point level, there was no need to restrict his "string layer" to
> UTF-8, especially since implementing code point iteration is much easier
> than grapheme cluster and higher ones which he did implement.

I'll let Zach speak for himself if he wishes to.


Received on 2018-06-19 20:41:48