C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] code_unit_sequence and code_point_sequence

From: Lyberta <lyberta_at_[hidden]>
Date: Tue, 19 Jun 2018 08:11:00 +0000
R. Martinho Fernandes:
> I don't think this code_unit_sequence is useful. The focus on endianness is misguided, IMO.
>
> There's no reason to convert encoding schemes to encoding forms (transcoding notwithstanding). The encoding forms from the Unicode standard that we need to support are UTF-8, UTF-16, UTF-16LE, UTF-16BE, UTF-32, UTF-32LE, UTF-32BE (and possibly forms with BOMs but that needs a different design because it's effectively a stateful encoding, so let's leave it out for now).
>
> There's no byte level. The lowest level that is useful is code units. It just happens that some encodings (e.g. UTF-8, UTF-16LE) have bytes as code units. Everything is code units.
>
> This code_unit_sequence *might* be useful as an implementation detail, but not so much as a user interface. All it does is abstract away endianness when that is already abstract by the encoding schemes themselves, like UTF-16LE.
>

I think having UTF-8, UTF-16, UTF-32 and UTF-16LE, UTF-16BE, UTF-32LE,
UTF-32BE on the same level is not useful since the latter ones return
bytes and bytes as code units in UTF-16 and UTF-32 would complicate
interface and implementation.

Instead of writing 3 code point iteration algorithms you would need to
implement 7.

test_view paper proposes std::utf16be_encoding and other while I think
it should be std::utf16 and std::endian.


Received on 2018-06-19 10:12:12