C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] code_unit_sequence and code_point_sequence

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 20 Jun 2018 11:30:09 -0400
On 06/20/2018 01:52 AM, Lyberta wrote:
> Tom Honermann:
>> Wanting to work with code points and code units is certainly fine and
>> often necessary. But what need is served by these new containers that
>> isn't already served by existing containers (std::vector, std::string,
>> QString, etc...)
> There is no container adaptor that takes a sequence of code units and
> adapts it as a sequence of code points.

Ok, but what is the use case for such a container that isn't already
served by either `text_view<code-unit-container>` or a `text` type where
underlying storage is an implementation detail? What benefit do we get
from being able to specify the underlying code unit storage container?
 From my perspective, such a design causes non-interoperable type
proliferation (as with `std::stack<int, std::vector<int>>` vs
`std::stack<int, std::deque<int>>`). As we've discussed, a facility to
move underlying storage between containers is desirable, but is
achievable without having to specify an underlying container type.

>
>> I agree with this goal and we've been discussing ways to accomplish it.
>> I've been intending to write a proposal to make it possible to extract
>> the buffer from std::vector and std::string (and eventually std::text)
>> and move it between such containers. This would work similarly to node
>> extraction for node based containers and would introduce a type similar
>> to node handle [1]. The intent is to enable something like:
>>
>> std::vector<char> v = ...;
>> std::string s(std::move(v.extract));
>> std::text t(std::move(s.extract());
>>
>> QString and other custom string types would be able to opt-in to the
>> mechanism.
> This demands something that doesn't exist yet for 3rd party types and
> will limit compatibility with them even further.

That is true, but a `std::extract` customization point could enable
integration for such types.

>
>> I don't see a convincing argument here and I'm still struggling to
>> understand your perspective. In my opinion, having the endian concerns
>> and byte level access wrapped in an encoding class that has a well known
>> name is quite convenient. Is your concern primarily philosophical?
>> Perhaps based on separation of concerns?
> Yes. Code point layer or layers above it shouldn't care about endianness
> or BOM. Only byte layer.

"should" and "shouldn't" do not make for a convincing argument. I
understand the motivation for design purity, but in practice, pure
designs frequently turn out to be cumbersome to use. I'm more motivated
by use cases than I am by philosophical arguments.

When I first started on text_view, I did experiment with separating
encoding schemes and encoding forms; e.g., by layering encoding forms on
top of endian transformations (only for BE/LE encoding variants), but I
found it only introduced complexity and didn't solve any actual problems.

> I have written serialization library (already
> ported to std::endian btw) and I will implement endianness and BOM
> handling so I will have code samples ready.
>
>> I can see uses for a codec that deals only with the endian concerns.
>> But why force programmers concerned with working with text to explicitly
>> interact with such a type?
> I idea that programmers won't need to.
>
> std::text t = u8"Hello";
>
> Type of text will be
> std::text<std::code_point_sequence<std::code_unit_sequence<std::utf8,
> std::endian::native, std::no_bom>>>;
>
> Here is standard library has chosen native endianness and no reading or
> writing of BOM - a sane default. Then we provide helpers such as:
>
> auto t = std::make_text<std::endian::big, std::bom>(u8"Hello");
>
> Type of text will be
> std::text<std::code_point_sequence<std::code_unit_sequence<std::utf8,
> std::endian::big, std::bom>>>;
>
> Here programmer has explicitly requested for BE with reading and writing
> of BOM. std::bom and std::no_bom are just placeholders, this should be
> an enum class.

Others have already noted the fact that endian concerns are not relevant
for UTF-8, so I'll assume the following for UTF-16:

std::text t = u"Hello";

Type of text will be
std::text<std::code_point_sequence<std::code_unit_sequence<std::utf16,
     std::endian::native, std::no_bom>>>;

Let's talk about what this means. Somewhere in 't' there is an array of elements that make up the underlying storage. Is that storage an array of char16_t, std::byte, unsigned char, or something else? I'm guessing one of the latter (e.g., byte based, not char16_t). If so, then this implies the initialization can't be a straight memcpy() of the u"Hello" initializer since the char16_t code units have to be transformed to BE/LE code units. I don't want to pay that cost, particularly because I see no benefit to actually storing this text in any particular endian order in memory. I want native char16_t code units for this case.

>
>> Is it your intention that, given u"text" (UTF-16), that it should be
>> possible to obtain a std::byte pointer to the underlying code units
>> (e.g., a sequence of bytes in either BE or LE order)? If so, that isn't
>> possible because some systems have 16-bit (or larger) bytes where
>> sizeof(char16_t)==1; the individual octets of char16_t are not
>> addressable on such systems.
> char16_t and char32_t are POD (I know this concept is deprecated) so
> doing reinterpret_cast<std::byte*>(&char16_t_variable) is legal and is
> necessary for serialization.

Yes, that is legal, but it doesn't do (portably) what I think you want.

I think you are under the impression that you can do this:

char16_t c = ...;
std::byte *p = reinterpret_cast<std::byte*>(&c);
p[0]; // high-or-low octet of 'c' depending on host endian order.
p[1]; // high-or-low octet of 'c' depending on host endian order.

This doesn't work because std::byte and char16_t may both be the same
size; it isn't possible to portably address the octets separately. On
systems where they are the same size, p[1] addresses storage beyond the
end of the storage for 'c'.

Tom.

Received on 2018-06-20 17:30:15