sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Lyberta <lyberta_at_[hidden]>
Date: Mon, 18 Jun 2018 18:41:00 +0000

Zach Laine:
> This is certainly the right venue. Do you have an interface in mind?
> Posting a synopsis could start things moving.
>
> Zach

code_unit_sequence works on layer 0 - bytes - and provides iterators to
layer 1 - code units. The intended use case is working with UTF-16 and
UTF-32 where endianness of stored units is not equal to machine
endianness and byte-swapping everything in advance is too slow. You will
use template metaprogramming to support all encodings and endiannesses.

Synopsys would be something like:

template <TextEncoding TE, std::endian Endianness = std::endian::native,
typename Allocator = std::allocator<std::byte>
class code_unit_sequence;

It will have the interface similar to your boost::string from Boost.Text
and have random access iterators that would return proxy type
convertible to char8_t for UTF-8, char16_t for UTF-16 and char32_t for
UTF-32. Maybe another template parameter for invalid code unit handling.

Also there will be a concept named CodeUnitSequence that requires the
similar interface to std::code_unit_sequence. I think both
std::vector<[w]char[8,16,32]_t> and std::basic_string should satisfy
that concept.

std::code_point_sequence works on layer 1 - code units - and provides
iterators to layer 2 - code points. It will take a type that satisfies
CodeUnitSequence and use it for memory management. It will provide
bidirectional iterators that return proxy type convertible to char32_t.
The iterators will be complex because a single code point can be consist
of different number of code units so assignment may lead to reallocation
of underlying buffer and invalidation of some iterators. I guess that
will break some std algorithms but that's the reality we will have to
deal with.

Synopsys would be something like:
template <CodeUnitSequence Container, TextEncoding ET =
std::default_encoding_type_t<Container>>
class code_point_sequence;

Of course, there will be corresponding view types.

I have implemented my own version of code_point_sequence and
code_point_sequence_view here:
https://gitlab.com/ftz/unicode

Of course, then we have std::text that would take CodePointSequence and
provide grapheme cluster iterators. My free time was not enough to
implement grapheme cluster iteration so I'll leave it to other people.

So I see at least 5 papers:
* Fundamental encoding concepts, types and helpers such as TextEncoding,
std::utf8, std::default_encoding_type_t, etc
* std::code_unit_sequence
* std::code_unit_sequence_view
* std::code_point_sequence
* std::code_point_sequence_view

It would be fair to standardize them in this order but views may be
standardized before the corresponding containers but we should see
implementations of containers before deciding on interface of views.

Received on 2018-06-18 20:50:57