C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] code_unit_sequence and code_point_sequence

From: Lyberta <lyberta_at_[hidden]>
Date: Tue, 19 Jun 2018 09:53:00 +0000
R. Martinho Fernandes:
> I don't quite understand why deserializing into 16-bit units is useful, though. I would expect code that deserializes text to either perform transcoding to produce a buffer in an encoding suitable to work with some external API, or otherwise to need the decoded text, not the code units. I might be missing something people do with code units, but IME they're either decoded or opaque blobs to pass elsewhere.

With code_unit_sequence [de]serialization is conceptually equivalent to
std::memcpy, with std::basic_string it is more complicated because of
potential byte swapping.

> More importantly, though, I don't understand what needs to get complicated in the code point interface. The interface you have is already enough as is (any reservations that some might have about adding new sequence container types notwithstanding). The interface requires no change at all to support UTF-16BE, etc; the implementation can use std::string just fine (remember, the code units for UTF-16BE are just bytes). It will probably work fine when you finish the implementation; it just needs implementations of the encoding schemes.

The proposed text_view takes TextEncoding and there are
std::utf16[be,le]_encodings that satisfy TextEncoding. This is breaking
abstraction and making user code more complicated. text_view and
code_point_sequence shouldn't take encoding schemes as template
parameters, only encoding forms. Essentially, TextEncoding is as
horrible as std::basic_string in its design.

I guess I should update my proposal:

template <CodeUnitSequence Container, EncodingForm EF =
std::default_encoding_form_t<Container>>
class code_point_sequence;

> The interface requires no change at all to support UTF-16BE.

UTF-16BE is encoding scheme and would not compile with the revised
interface. Thanks you for informing me about distinction between
encoding forms and encoding schemes.


Received on 2018-06-19 11:53:59