Date: Mon, 18 Jun 2018 22:03:45 +0200
I don't think this code_unit_sequence is useful. The focus on endianness is misguided, IMO.
There's no reason to convert encoding schemes to encoding forms (transcoding notwithstanding). The encoding forms from the Unicode standard that we need to support are UTF-8, UTF-16, UTF-16LE, UTF-16BE, UTF-32, UTF-32LE, UTF-32BE (and possibly forms with BOMs but that needs a different design because it's effectively a stateful encoding, so let's leave it out for now).
There's no byte level. The lowest level that is useful is code units. It just happens that some encodings (e.g. UTF-8, UTF-16LE) have bytes as code units. Everything is code units.
This code_unit_sequence *might* be useful as an implementation detail, but not so much as a user interface. All it does is abstract away endianness when that is already abstract by the encoding schemes themselves, like UTF-16LE.
On June 18, 2018 8:41:00 PM GMT+02:00, Lyberta <lyberta_at_[hidden]> wrote:
>Zach Laine:
>> This is certainly the right venue. Do you have an interface in mind?
>> Posting a synopsis could start things moving.
>>
>> Zach
>
>code_unit_sequence works on layer 0 - bytes - and provides iterators to
>layer 1 - code units. The intended use case is working with UTF-16 and
>UTF-32 where endianness of stored units is not equal to machine
>endianness and byte-swapping everything in advance is too slow. You
>will
>use template metaprogramming to support all encodings and endiannesses.
>
>Synopsys would be something like:
>
>template <TextEncoding TE, std::endian Endianness =
>std::endian::native,
>typename Allocator = std::allocator<std::byte>
>class code_unit_sequence;
>
>It will have the interface similar to your boost::string from
>Boost.Text
>and have random access iterators that would return proxy type
>convertible to char8_t for UTF-8, char16_t for UTF-16 and char32_t for
>UTF-32. Maybe another template parameter for invalid code unit
>handling.
>
>Also there will be a concept named CodeUnitSequence that requires the
>similar interface to std::code_unit_sequence. I think both
>std::vector<[w]char[8,16,32]_t> and std::basic_string should satisfy
>that concept.
>
>std::code_point_sequence works on layer 1 - code units - and provides
>iterators to layer 2 - code points. It will take a type that satisfies
>CodeUnitSequence and use it for memory management. It will provide
>bidirectional iterators that return proxy type convertible to char32_t.
>The iterators will be complex because a single code point can be
>consist
>of different number of code units so assignment may lead to
>reallocation
>of underlying buffer and invalidation of some iterators. I guess that
>will break some std algorithms but that's the reality we will have to
>deal with.
>
>Synopsys would be something like:
>template <CodeUnitSequence Container, TextEncoding ET =
>std::default_encoding_type_t<Container>>
>class code_point_sequence;
>
>Of course, there will be corresponding view types.
>
>I have implemented my own version of code_point_sequence and
>code_point_sequence_view here:
>https://gitlab.com/ftz/unicode
>
>Of course, then we have std::text that would take CodePointSequence and
>provide grapheme cluster iterators. My free time was not enough to
>implement grapheme cluster iteration so I'll leave it to other people.
>
>So I see at least 5 papers:
>* Fundamental encoding concepts, types and helpers such as
>TextEncoding,
>std::utf8, std::default_encoding_type_t, etc
>* std::code_unit_sequence
>* std::code_unit_sequence_view
>* std::code_point_sequence
>* std::code_point_sequence_view
>
>It would be fair to standardize them in this order but views may be
>standardized before the corresponding containers but we should see
>implementations of containers before deciding on interface of views.
There's no reason to convert encoding schemes to encoding forms (transcoding notwithstanding). The encoding forms from the Unicode standard that we need to support are UTF-8, UTF-16, UTF-16LE, UTF-16BE, UTF-32, UTF-32LE, UTF-32BE (and possibly forms with BOMs but that needs a different design because it's effectively a stateful encoding, so let's leave it out for now).
There's no byte level. The lowest level that is useful is code units. It just happens that some encodings (e.g. UTF-8, UTF-16LE) have bytes as code units. Everything is code units.
This code_unit_sequence *might* be useful as an implementation detail, but not so much as a user interface. All it does is abstract away endianness when that is already abstract by the encoding schemes themselves, like UTF-16LE.
On June 18, 2018 8:41:00 PM GMT+02:00, Lyberta <lyberta_at_[hidden]> wrote:
>Zach Laine:
>> This is certainly the right venue. Do you have an interface in mind?
>> Posting a synopsis could start things moving.
>>
>> Zach
>
>code_unit_sequence works on layer 0 - bytes - and provides iterators to
>layer 1 - code units. The intended use case is working with UTF-16 and
>UTF-32 where endianness of stored units is not equal to machine
>endianness and byte-swapping everything in advance is too slow. You
>will
>use template metaprogramming to support all encodings and endiannesses.
>
>Synopsys would be something like:
>
>template <TextEncoding TE, std::endian Endianness =
>std::endian::native,
>typename Allocator = std::allocator<std::byte>
>class code_unit_sequence;
>
>It will have the interface similar to your boost::string from
>Boost.Text
>and have random access iterators that would return proxy type
>convertible to char8_t for UTF-8, char16_t for UTF-16 and char32_t for
>UTF-32. Maybe another template parameter for invalid code unit
>handling.
>
>Also there will be a concept named CodeUnitSequence that requires the
>similar interface to std::code_unit_sequence. I think both
>std::vector<[w]char[8,16,32]_t> and std::basic_string should satisfy
>that concept.
>
>std::code_point_sequence works on layer 1 - code units - and provides
>iterators to layer 2 - code points. It will take a type that satisfies
>CodeUnitSequence and use it for memory management. It will provide
>bidirectional iterators that return proxy type convertible to char32_t.
>The iterators will be complex because a single code point can be
>consist
>of different number of code units so assignment may lead to
>reallocation
>of underlying buffer and invalidation of some iterators. I guess that
>will break some std algorithms but that's the reality we will have to
>deal with.
>
>Synopsys would be something like:
>template <CodeUnitSequence Container, TextEncoding ET =
>std::default_encoding_type_t<Container>>
>class code_point_sequence;
>
>Of course, there will be corresponding view types.
>
>I have implemented my own version of code_point_sequence and
>code_point_sequence_view here:
>https://gitlab.com/ftz/unicode
>
>Of course, then we have std::text that would take CodePointSequence and
>provide grapheme cluster iterators. My free time was not enough to
>implement grapheme cluster iteration so I'll leave it to other people.
>
>So I see at least 5 papers:
>* Fundamental encoding concepts, types and helpers such as
>TextEncoding,
>std::utf8, std::default_encoding_type_t, etc
>* std::code_unit_sequence
>* std::code_unit_sequence_view
>* std::code_point_sequence
>* std::code_point_sequence_view
>
>It would be fair to standardize them in this order but views may be
>standardized before the corresponding containers but we should see
>implementations of containers before deciding on interface of views.
Received on 2018-06-18 22:11:23