sg16: Re: [SG16-Unicode] Strong code unit types

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 5 Dec 2018 22:42:46 -0500

On 12/5/18 11:35 AM, Lyberta wrote:
>> Distinct codeunit types are probably not worth the effort.
> Actually, I think adding char8_t, char16_t and char32_t to the language
> was a mistake.
>
> C doesn't have overloading so it can easily work with uint_least8_t,
> uint_least16_t and uint_least32_t for code units.
In C, char16_t and char32_t are typedefs of uint_least16_t and
uint_least32_t respectively.
>
> C++ should have added strong types for code units that would have
> respective C types as private members and overload on them. Literals in
> C++ should produce arrays of strong types.
Arguably, C++ did this (except for u8 literals until now). The C++ code
unit types have underlying types equivalent to C's types.
>
> Strong types for code units are important to ban implicit conversions of
> code unit and code point sequences that use the same fundamental types
> under the hood but different encodings.
I agree with this goal, but I think it is achievable by ensuring a
distinct type for code points (for each character set).
>
> We can add wtf8_code_unit and cesu8_code_unit later for WTF-8, CESU8 and
> all other needed encodings.
Indeed we can.
>
> I have already started to rewrite my Unicode library using strong types
> and I immediately see that my code suddenly became much more
> maintainable. UTF-8 can't have C0, C1 and F5-FF code units, my code was
> littered with checks for these but now I only have 1 check inside
> utf8_code_unit (that fixed some bugs because I missed a few checks
> before). Same for UTF-32 which can't have surrogates and >10FFFF.
As mentioned earlier, the validation can have undesirable overhead. I
can appreciate that it simplifies programming by enforcing your desired
invariants though.
>
> Different encodings can have different code point types. Unicode should
> get unicode_code_point that wraps char32_t. Later we can add other
> encodings such as ASCII so ascii_code_point would wrap char.
This is something I've wanted as well. In text_view, I called this type
a "character" (probably not the best choice of name) and defined it as
the association of a code point value with a character set. See
https://github.com/tahonermann/text_view#concept-character.

Tom.

Received on 2018-12-06 04:42:49