C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] code_unit_sequence and code_point_sequence

From: R. Martinho Fernandes <rmf_at_[hidden]>
Date: Tue, 19 Jun 2018 10:41:03 +0200
I think I wasn't clear enough. The code_unit_sequence interface can be used as an implementation detail without being part of the interface. There's nothing keeping an implementation from doing things this way and still having only 3 algorithms to implement instead of 7, while at the same time providing an interface supporting 7 encoding schemes.

I am reasonably convinced that the code_point_view interface is enough; if it can handle encoding forms, it can trivially handle encoding schemes too, as encoding schemes are just a subset of encoding forms and require no special processing.

For a user, however, I struggle to find value in converting bytes to to 16-bit units, as those units are completely meaningless until they are decoded (via e.g. code_point_view).

Maybe you know a use case for this that isn't the implementation nor transcoding?

Note: the correct model for transcoding is converting from code units to code units; we tried bytes<->code units with the now-deprecated codecvt facets and it didn't work.

On June 19, 2018 10:11:00 AM GMT+02:00, Lyberta <lyberta_at_[hidden]> wrote:
>R. Martinho Fernandes:
>> I don't think this code_unit_sequence is useful. The focus on
>endianness is misguided, IMO.
>>
>> There's no reason to convert encoding schemes to encoding forms
>(transcoding notwithstanding). The encoding forms from the Unicode
>standard that we need to support are UTF-8, UTF-16, UTF-16LE, UTF-16BE,
>UTF-32, UTF-32LE, UTF-32BE (and possibly forms with BOMs but that needs
>a different design because it's effectively a stateful encoding, so
>let's leave it out for now).
>>
>> There's no byte level. The lowest level that is useful is code units.
>It just happens that some encodings (e.g. UTF-8, UTF-16LE) have bytes
>as code units. Everything is code units.
>>
>> This code_unit_sequence *might* be useful as an implementation
>detail, but not so much as a user interface. All it does is abstract
>away endianness when that is already abstract by the encoding schemes
>themselves, like UTF-16LE.
>>
>
>I think having UTF-8, UTF-16, UTF-32 and UTF-16LE, UTF-16BE, UTF-32LE,
>UTF-32BE on the same level is not useful since the latter ones return
>bytes and bytes as code units in UTF-16 and UTF-32 would complicate
>interface and implementation.
>
>Instead of writing 3 code point iteration algorithms you would need to
>implement 7.
>
>test_view paper proposes std::utf16be_encoding and other while I think
>it should be std::utf16 and std::endian.

Received on 2018-06-19 10:41:09