C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] code_unit_sequence and code_point_sequence
From: Tom Honermann (tom_at_[hidden])
Date: 2018-06-19 20:30:37


On 06/19/2018 03:00 PM, Lyberta wrote:
> Tom Honermann:
>> This is overspecification in my opinion.  And like Martinho, I don't see
>> the point of code_unit_sequence (or code_point_sequence); that is a
>> concept, not a container.
> This is the "experts only" feature. Some people want to work with code
> points and code units.

Wanting to work with code points and code units is certainly fine and
often necessary.  But what need is served by these new containers that
isn't already served by existing containers (std::vector, std::string,
QString, etc...)

>
>> Why do you think it is important to specify an underlying storage
>> container type for std::text?
> I don't want to copy data between QString and std::text. I want
> std::text to consume QString by moving it inside.
>
> QString string{"Hello"};
> auto text = MakeStdText(std::move(s));
>
> Type of text is now std::text<std::code_point_sequence<QString,
> std::utf16>>.

I agree with this goal and we've been discussing ways to accomplish it. 
I've been intending to write a proposal to make it possible to extract
the buffer from std::vector and std::string (and eventually std::text)
and move it between such containers.  This would work similarly to node
extraction for node based containers and would introduce a type similar
to node handle [1].  The intent is to enable something like:

std::vector<char> v = ...;
std::string s(std::move(v.extract));
std::text t(std::move(s.extract());

QString and other custom string types would be able to opt-in to the
mechanism.

[1]: http://en.cppreference.com/w/cpp/container/node_handle

>
>> Because that enables wrapping network and file based I/O without
>> requiring additional storage or conversions. These are real use cases.
>> Perhaps you just haven't had a need for them?
> In my design network and file I/O are handled by
> std::code_unit_sequence[_view] because it is a byte level so byte level
> classes should handle it.

I don't see a convincing argument here and I'm still struggling to
understand your perspective.  In my opinion, having the endian concerns
and byte level access wrapped in an encoding class that has a well known
name is quite convenient.  Is your concern primarily philosophical? 
Perhaps based on separation of concerns?

I can see uses for a codec that deals only with the endian concerns. 
But why force programmers concerned with working with text to explicitly
interact with such a type?

>
>> I wonder if there is some disconnect between what text_view provides and
>> what you think it provides. It would be helpful if you were to provide
>> some example code that we could use to clarify discussion; something
>> that would allow side-by-side comparisons of various interfaces.
> I have started to implement code_unit_sequence and will report my findings.
>
>>> Since we are aiming for a standard library, it is assumed that
>>> implementers know the value of std::endian::native.
>> That doesn't isolate programmers that use the standard library from
>> being impacted.
> That's why code_unit_sequence::data() returns std::byte* (should it be
> std::span<std::byte>?) so programmers can pass those blobs of bytes
> anywhere they want. The endianness conversions should be handled by the
> standard library.

Is it your intention that, given u"text" (UTF-16), that it should be
possible to obtain a std::byte pointer to the underlying code units
(e.g., a sequence of bytes in either BE or LE order)?  If so, that isn't
possible because some systems have 16-bit (or larger) bytes where
sizeof(char16_t)==1; the individual octets of char16_t are not
addressable on such systems.

Tom.


SG16 list run by herb.sutter at gmail.com