sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Martinho Fernandes <rmf_at_[hidden]>
Date: Tue, 19 Jun 2018 14:16:40 +0200

On 19.06.18 13:59, Lyberta wrote:
> Example:
>
> code_point_sequence<utf32be> s = ...
> for (const auto code_unit : s)
> {
> std::cout << code_unit << '\n';
> }
>
> error: no match for 'operator<<' (operand types are 'std::ostream' {aka
> 'std::basic_ostream<char>'} and 'std::byte')
>
> A user would expect to see integer in the range 0-10FFFF but all they
> would get is a compile type error.

I'm sorry, but this example doesn't make sense. I don't expect iteration
over a code_point_sequence to produce code units, and I don't think
anyone should expect that. The whole point of such types is to hide the
code units from the interface.

> Unicode is hard, very hard. The only way to make people's code correct
> is to apply hard rigor. Encoding form and encoding scheme are
> fundamentally different things. If we start mixing them, it will
> eventually break in unforeseen ways.

They are definitely not fundamentally different. As I explained, any
encoding scheme without BOM fits the definition of an encoding form with
byte-sized integers as code units. Byte-sized integer code units are
also not a crazy idea, given that that is exactly what UTF-8 has. It is
simply not possible to have support for encoding forms that doesn't work
with BOM-free encoding schemes.

> If you want to suggest that encoding schemes should use "char" as their
> code unit type, please don't. Char is the most dreaded part of C++ for
> me. It has unknown signedness, it breaks strict aliasing which leads to
> a slow code, and, worst of all, people tend to think that "char" means
> "character". I get frustrated every time I use "char" in my code and I
> was leaping in joy when std::byte was standardized. Please, let char die
> its well-deserved death.

I do not think I suggested that, but if it came across as such, I didn't
intend it so. Any appropriate byte-sized integer type will do.

> This would be a deserialization with "batteries included". I'm talking
> about the case where we want to quickly read bytes and store them in
> code_unit_sequence until further notice.

But what does one do with a code_unit_sequence? It seems like an
unnecessary intermediate step to me. Can you show an example of using
such a code_unit_sequence after deserialization outside of the
implementation of code_point_sequence?

-- 
Martinho

Received on 2018-06-19 14:18:05