sg16: Re: [SG16-Unicode] Strong code unit types

From: Lyberta <lyberta_at_[hidden]>
Date: Wed, 05 Dec 2018 16:35:00 +0000

> Distinct codeunit types are probably not worth the effort.

Actually, I think adding char8_t, char16_t and char32_t to the language
was a mistake.

C doesn't have overloading so it can easily work with uint_least8_t,
uint_least16_t and uint_least32_t for code units.

C++ should have added strong types for code units that would have
respective C types as private members and overload on them. Literals in
C++ should produce arrays of strong types.

Strong types for code units are important to ban implicit conversions of
code unit and code point sequences that use the same fundamental types
under the hood but different encodings.

We can add wtf8_code_unit and cesu8_code_unit later for WTF-8, CESU8 and
all other needed encodings.

I have already started to rewrite my Unicode library using strong types
and I immediately see that my code suddenly became much more
maintainable. UTF-8 can't have C0, C1 and F5-FF code units, my code was
littered with checks for these but now I only have 1 check inside
utf8_code_unit (that fixed some bugs because I missed a few checks
before). Same for UTF-32 which can't have surrogates and >10FFFF.

Different encodings can have different code point types. Unicode should
get unicode_code_point that wraps char32_t. Later we can add other
encodings such as ASCII so ascii_code_point would wrap char.

Regarding "codeunit" vs "code_unit". Unicode standard never says
"codeunit" and only uses "code unit". Same with code point, so types
should have underscores.

Received on 2018-12-05 17:35:38