sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Lyberta <lyberta_at_[hidden]>
Date: Wed, 20 Jun 2018 05:52:00 +0000

Tom Honermann:
> Wanting to work with code points and code units is certainly fine and
> often necessary. But what need is served by these new containers that
> isn't already served by existing containers (std::vector, std::string,
> QString, etc...)

There is no container adaptor that takes a sequence of code units and
adapts it as a sequence of code points.

> I agree with this goal and we've been discussing ways to accomplish it.
> I've been intending to write a proposal to make it possible to extract
> the buffer from std::vector and std::string (and eventually std::text)
> and move it between such containers. This would work similarly to node
> extraction for node based containers and would introduce a type similar
> to node handle [1]. The intent is to enable something like:
>
> std::vector<char> v = ...;
> std::string s(std::move(v.extract));
> std::text t(std::move(s.extract());
>
> QString and other custom string types would be able to opt-in to the
> mechanism.

This demands something that doesn't exist yet for 3rd party types and
will limit compatibility with them even further.

> I don't see a convincing argument here and I'm still struggling to
> understand your perspective. In my opinion, having the endian concerns
> and byte level access wrapped in an encoding class that has a well known
> name is quite convenient. Is your concern primarily philosophical?
> Perhaps based on separation of concerns?

Yes. Code point layer or layers above it shouldn't care about endianness
or BOM. Only byte layer. I have written serialization library (already
ported to std::endian btw) and I will implement endianness and BOM
handling so I will have code samples ready.

> I can see uses for a codec that deals only with the endian concerns.
> But why force programmers concerned with working with text to explicitly
> interact with such a type?

I idea that programmers won't need to.

std::text t = u8"Hello";

Type of text will be
std::text<std::code_point_sequence<std::code_unit_sequence<std::utf8,
std::endian::native, std::no_bom>>>;

Here is standard library has chosen native endianness and no reading or
writing of BOM - a sane default. Then we provide helpers such as:

auto t = std::make_text<std::endian::big, std::bom>(u8"Hello");

Type of text will be
std::text<std::code_point_sequence<std::code_unit_sequence<std::utf8,
std::endian::big, std::bom>>>;

Here programmer has explicitly requested for BE with reading and writing
of BOM. std::bom and std::no_bom are just placeholders, this should be
an enum class.

> Is it your intention that, given u"text" (UTF-16), that it should be
> possible to obtain a std::byte pointer to the underlying code units
> (e.g., a sequence of bytes in either BE or LE order)? If so, that isn't
> possible because some systems have 16-bit (or larger) bytes where
> sizeof(char16_t)==1; the individual octets of char16_t are not
> addressable on such systems.

char16_t and char32_t are POD (I know this concept is deprecated) so
doing reinterpret_cast<std::byte*>(&char16_t_variable) is legal and is
necessary for serialization. But I don't know how text handling (or
especially file I/O) works on such systems. It is possible to portably
work with octets by holding them in std::uint_least8_t. It is possible
to convert char16_t to sequence of octets by doing masking and bit
shifts. On a system where CHAR_BIT > 8 char8_t will have unused bits, on
a system where CHAR_BIT > 16 char16_t will have unused bits, on systems
where CHAR_BIT > 32 char32_t will have unused bits. That's not a problem.

> Technically, Windows doesn't violate the standard by having a 16-bit
> wchar_t. It violates the standard by using a wide execution character
> set that defines code points that do not fit in it's (16-bit) wchar_t
> type. We have an issue (https://github.com/sg16-unicode/sg16/issues/9)
> to track modifying the standard to enable Microsoft's implementation to
> be conforming.

My personal opinion is that wchar_t has failed to achieve its goal and
should be removed from the standard. I guess we will have to wait 20+
years for that.

Received on 2018-06-20 07:52:18