sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Lyberta <lyberta_at_[hidden]>
Date: Tue, 19 Jun 2018 11:59:00 +0000

Martinho Fernandes:
> Can you explain how the user code becomes more complicated? Perhaps with
> examples?

Because people don't expect to have bytes as code units in UTF-16 and
UTF-32. std::basic_string was designed with a lot of skew towards ASCII
and Unicode 1.0 and thought that "char" and "wchar_t" were meaningful
units of text. We all know how this crashed and burned.

If encoding schemes use bytes as code units, their code unit type should
be std::byte and std::byte is not an integer and will lead to countless
compile errors and users asking tons of same questions on StackOverflow etc.

Example:

code_point_sequence<utf32be> s = ...
for (const auto code_unit : s)
{
std::cout << code_unit << '\n';
}

error: no match for 'operator<<' (operand types are 'std::ostream' {aka
'std::basic_ostream<char>'} and 'std::byte')

A user would expect to see integer in the range 0-10FFFF but all they
would get is a compile type error.

>> text_view and
>> code_point_sequence shouldn't take encoding schemes as template
>> parameters, only encoding forms. Essentially, TextEncoding is as
>> horrible as std::basic_string in its design.
> Can you explain why it shouldn't take encoding schemes? There is no
> explanation here, and it isn't clear to me why not.
>

Unicode is hard, very hard. The only way to make people's code correct
is to apply hard rigor. Encoding form and encoding scheme are
fundamentally different things. If we start mixing them, it will
eventually break in unforeseen ways.

If you want to suggest that encoding schemes should use "char" as their
code unit type, please don't. Char is the most dreaded part of C++ for
me. It has unknown signedness, it breaks strict aliasing which leads to
a slow code, and, worst of all, people tend to think that "char" means
"character". I get frustrated every time I use "char" in my code and I
was leaping in joy when std::byte was standardized. Please, let char die
its well-deserved death.

Steve Downey:
> I would think that deserialization would be an operation on a Range of
> std::byte or int8_t, where you would read out code points depending on the
> encoding. Possibly with either replacement or failure. But until you have
> code points, it's not text, it's raw octets. [Are we still supporting the
> hypothetical 9 bit byte computer in the standard?]

This would be a deserialization with "batteries included". I'm talking
about the case where we want to quickly read bytes and store them in
code_unit_sequence until further notice.

Higher layers are handled by code_point_sequence and text.

There is nothing wrong with supporting systems where CHAR_BIT != 8.
Clang was recently ported to a system where CHAR_BIT == 16. Although I
don't think people will do text processing on such systems. In any case
we just use std::byte in the wording.

Received on 2018-06-19 13:59:53