C++ Logo


Advanced search

Subject: Re: [SG16-Unicode] code_unit_sequence and code_point_sequence
From: Martinho Fernandes (rmf_at_[hidden])
Date: 2018-06-20 08:52:28

On 20.06.18 14:53, Lyberta wrote:
>> With all due respect, I think that the proposal really needs to get
>> *proper use cases* sorted out first. All this time we've been asking
>> "what do I use this for" and getting poorly thought-out examples that
>> actually demonstrate flaws instead of demonstrating usage (why is it
>> possible to have "big endian utf-8" and "little endian utf-8" as
>> separate types at all?).
> Because in the general case encoding form and endianness are independent
> unless someone will prove that all other encodings except UTF-16 and
> UTF-32 work in terms of bytes.

I still don't see the case for treating byte order as an orthogonal
concern given the existence of e.g. UTF-8. But since you asked, yes, the
majority of encodings have byte-sized code units. The only important
exceptions that I can think of are UTF-16, UTF-32, and the legacy UCS-2
and UCS-4, which have both been superseded by, respectively, UTF-16 and
UTF-32. All the truly relevant encodings have bytes as code units; all
of, roughly in order of prevalence: UTF-8, the ISO-8859 family,
Windows-1252, Windows-1251, Shift-JIS, GB2312, GBK, GB18030, the EUC
family, the KOI8 family, TIS-620. Big5 is the only one with some use
that has, at least on paper, double-byte code units. However, Big5 was
expressly designed to coexist with a single-byte (7-bit) encoding, and
in practice it coexists with US-ASCII, so it cannot be reasonably
treated as a double-byte encoding.

>> The thing that trips me is that I still don't know what kind of usage
>> this enables that a simpler design wouldn't enable. A simpler design
>> would be one that doesn't have three specialized containers, one that
>> doesn't have a "bytes to code units" adapter of dubious value, one that
>> doesn't leak byte order concerns everywhere, one that isn't built on the
>> assumption that we want basic_string to be removed.
> Can a simpler design work *on top* of QString? wxString? CString
> wrapper?

They can and they do and they do it with all the kinks; you don't get a
smaller feature set for using these. Tom's text_view treats all sources
the same. My six-year old implementation
(https://github.com/libogonek/ogonek/tree/devel/include/ogonek) also
treats all sources the same. None of these special-case byte order, and
none of them have byte-code unit adapters.

> I don't mind my version of std::code_unit_sequence not being
> standardized and instead we continue using std::basic_string under the
> hood, but we wrap it inside a code point layer and grapheme cluster
> layers.

If there isn't a strong motivating example for doing otherwise, this is
the most likely approach to be voted and approved, and it's what efforts
should be focused on.


SG16 list run by herb.sutter at gmail.com