C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] code_unit_sequence and code_point_sequence

From: Martinho Fernandes <rmf_at_[hidden]>
Date: Wed, 20 Jun 2018 15:52:28 +0200
On 20.06.18 14:53, Lyberta wrote:
>> With all due respect, I think that the proposal really needs to get
>> *proper use cases* sorted out first. All this time we've been asking
>> "what do I use this for" and getting poorly thought-out examples that
>> actually demonstrate flaws instead of demonstrating usage (why is it
>> possible to have "big endian utf-8" and "little endian utf-8" as
>> separate types at all?).
>
> Because in the general case encoding form and endianness are independent
> unless someone will prove that all other encodings except UTF-16 and
> UTF-32 work in terms of bytes.

I still don't see the case for treating byte order as an orthogonal
concern given the existence of e.g. UTF-8. But since you asked, yes, the
majority of encodings have byte-sized code units. The only important
exceptions that I can think of are UTF-16, UTF-32, and the legacy UCS-2
and UCS-4, which have both been superseded by, respectively, UTF-16 and
UTF-32. All the truly relevant encodings have bytes as code units; all
of, roughly in order of prevalence: UTF-8, the ISO-8859 family,
Windows-1252, Windows-1251, Shift-JIS, GB2312, GBK, GB18030, the EUC
family, the KOI8 family, TIS-620. Big5 is the only one with some use
that has, at least on paper, double-byte code units. However, Big5 was
expressly designed to coexist with a single-byte (7-bit) encoding, and
in practice it coexists with US-ASCII, so it cannot be reasonably
treated as a double-byte encoding.

>> The thing that trips me is that I still don't know what kind of usage
>> this enables that a simpler design wouldn't enable. A simpler design
>> would be one that doesn't have three specialized containers, one that
>> doesn't have a "bytes to code units" adapter of dubious value, one that
>> doesn't leak byte order concerns everywhere, one that isn't built on the
>> assumption that we want basic_string to be removed.
>
> Can a simpler design work *on top* of QString? wxString? CString
> wrapper?

They can and they do and they do it with all the kinks; you don't get a
smaller feature set for using these. Tom's text_view treats all sources
the same. My six-year old implementation
(https://github.com/libogonek/ogonek/tree/devel/include/ogonek) also
treats all sources the same. None of these special-case byte order, and
none of them have byte-code unit adapters.

> I don't mind my version of std::code_unit_sequence not being
> standardized and instead we continue using std::basic_string under the
> hood, but we wrap it inside a code point layer and grapheme cluster
> layers.

If there isn't a strong motivating example for doing otherwise, this is
the most likely approach to be voted and approved, and it's what efforts
should be focused on.

-- 
Martinho

Received on 2018-06-20 15:53:55