C++ Logo


Advanced search

Re: [SG16-Unicode] code_unit_sequence and code_point_sequence

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 19 Jun 2018 14:01:29 -0400
On 06/19/2018 04:11 AM, Lyberta wrote:
> R. Martinho Fernandes:
>> I don't think this code_unit_sequence is useful. The focus on endianness is misguided, IMO.
>> There's no reason to convert encoding schemes to encoding forms (transcoding notwithstanding). The encoding forms from the Unicode standard that we need to support are UTF-8, UTF-16, UTF-16LE, UTF-16BE, UTF-32, UTF-32LE, UTF-32BE (and possibly forms with BOMs but that needs a different design because it's effectively a stateful encoding, so let's leave it out for now).
>> There's no byte level. The lowest level that is useful is code units. It just happens that some encodings (e.g. UTF-8, UTF-16LE) have bytes as code units. Everything is code units.
>> This code_unit_sequence *might* be useful as an implementation detail, but not so much as a user interface. All it does is abstract away endianness when that is already abstract by the encoding schemes themselves, like UTF-16LE.
> I think having UTF-8, UTF-16, UTF-32 and UTF-16LE, UTF-16BE, UTF-32LE,
> UTF-32BE on the same level is not useful since the latter ones return
> bytes and bytes as code units in UTF-16 and UTF-32 would complicate
> interface and implementation.

UTF-16 and UTF-32 are convenient for views over u"text" and U"text"
respectively. And the BE/LE variants are useful as views over (byte
oriented) network and file I/O (without having to first convert from
encoding scheme to encoding form).

Following the thread further, it seems you would like to have a simple
codec for translating BE/LE data (e.g., to load BE/LE byte oriented data
into native endian larger-than-byte types). That sounds reasonable, but
I don't see why it should be part of text interfaces.

> Instead of writing 3 code point iteration algorithms you would need to
> implement 7.

Martinho already made this point, but I'll repeat it: some of the
variants can be implemented using the others. Text_view doesn't do so
for these cases (it does for other cases, for example, for stateful BOM
handling); mainly because it doesn't save much. These are not
complicated algorithms.

> test_view paper proposes std::utf16be_encoding and other while I think
> it should be std::utf16 and std::endian.

Text_view precedes the introduction of std::endian. Regardless, I don't
think I would want std::endian to be part of the encoding signatures;
that seems clumsy. Note that endian::native may or may not equal either
of endian::big or endian::little, so writing class template
specializations would get ugly. I'd rather use different names than
different specializations to handle the BE/LE differences.


Received on 2018-06-19 20:11:12