sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Lyberta <lyberta_at_[hidden]>
Date: Wed, 20 Jun 2018 12:53:00 +0000

Martinho Fernandes:
> On 20.06.18 11:47, Lyberta wrote:
>> In my proposal std::text takes something that satisfies
>> CodePointSequence and std::text<std::vector<char32_t>> will compile and
>> work as expected, just no BOM and endianness handling inside std::vector.
>
> So how do I get "endianness handling" if my data is in a vector?

In my serialization library binary streams have endianness that can be
changed to runtime. A serialization function will iterate over code
units, reinterpret them as sequences of bytes, swapping them if
necessary and appending them to stream. std::code_unit_sequence does it
in-place so it becomes essentially just std::copy.

> The real question is: why do I need to ask that
> question at all? You'll notice that simpler designs simply don't have
> this problem.

I have never seen a good Unicode library in C++. ICU is a joke and other
ones seem to provide just tiny bits of what a proper Unicode library
would provide.

> They can work with any source and do UTF-16BE/LE just
> fine, with the same interface that handles any other encoding, nothing
> special. The interface that handles one case handles all the cases, all
> in the same fashion.

std::text handles everything by just taking CodePointSequence. It
doesn't need to know the encoding form nor endianness of the sequence
because its iterators returns something convertible to char32_t and this
is all that is needed.

> With all due respect, I think that the proposal really needs to get
> *proper use cases* sorted out first. All this time we've been asking
> "what do I use this for" and getting poorly thought-out examples that
> actually demonstrate flaws instead of demonstrating usage (why is it
> possible to have "big endian utf-8" and "little endian utf-8" as
> separate types at all?).

Because in the general case encoding form and endianness are independent
unless someone will prove that all other encodings except UTF-16 and
UTF-32 work in terms of bytes.

> I don't need to see an implementation; I alone have implemented this
> sort of thing four times over already, and I know for a fact that others
> on this list have done the same. I (we?) trivially believe this can be
> implemented because it *has been* implemented.

Can you show me the code?

> The thing that trips me is that I still don't know what kind of usage
> this enables that a simpler design wouldn't enable. A simpler design
> would be one that doesn't have three specialized containers, one that
> doesn't have a "bytes to code units" adapter of dubious value, one that
> doesn't leak byte order concerns everywhere, one that isn't built on the
> assumption that we want basic_string to be removed.

Can a simpler design work *on top* of QString? wxString? CString
wrapper? The core of my design is to support countless containers that
satisfy CodeUnitSequence and CodePointSequence. We will do a massive
disservice to a lot of people if we demand to copy the underlying
buffer. ICU did this and it is the main reason why it's so bad. People
will not migrate from their containers but they would love a proper
Unicode support.

I don't mind my version of std::code_unit_sequence not being
standardized and instead we continue using std::basic_string under the
hood, but we wrap it inside a code point layer and grapheme cluster layers.

Received on 2018-06-20 14:53:16