sg16: Re: [SG16-Unicode] Strong code unit types

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Wed, 5 Dec 2018 11:12:14 -0500

Dear SG16,

     I think a codepoint type would be very helpful, even if it is just a
strong typedef over char32_t that we manually define in the library. I am
not sure it would be a great idea to ask for another primitive type in
C++'s Core Language, since this one can be done fairly well in the library
with the appropriately operator-strapped strong typedef.

     With explicit constructors from `char32_t` we can probably realize
this dream fairly well, even if it might make code very verbose. (Making it
a regular, non-explicit constructor can probably aid ease of use for this
who already use Unicode and work with char32_t or uint32_t and friends.)

     Distinct codeunit types are probably not worth the effort. Validation
is not something done on singular code units basis to begin with, these are
multibyte sequences. Fundamentally, validation should work at the
multi-code-unit level: presenting anything else proliferates the confusion
that a single code unit by itself is meaningful. It is not meaningful.

     Furthermore, there are more encodings than the 3 we would have these
validated code units for. While first-class support for Unicode at such a
level would be good, individual code units hardly are worth the validation:
sequences are what is more important. This also leaves room for CESU8,
WTF8, and similar transformations which may or may not encode things
outside of the typical range an individual code unit has but still makes
sense for its sequencing rules.

     Let's focus on sequences.

All the Best,
JeanHeyd

(P.S.: code_unit and code_point or codeunit and codepoint?)

On Wed, Dec 5, 2018 at 9:15 AM Tom Honermann <tom_at_[hidden]> wrote:

> On 12/5/18 8:05 AM, Steve Downey wrote:
>
> `codepoint` also, which is probably "just" a char32_t?
>
> No, I think a type that isn't convertible from code unit types is
> desirable. (I'm interpreting your response as implying that 'codepoint'
> would just be a type alias of 'char32_t' as opposed to a distinct strong
> type)
>
> Thinking about the std::isalnum example we discussed this week. The
> problem was that it was being called with code unit values, but its
> parameter type means something more like a code point. Code like the
> following is well-formed and follows current recommendations for correct
> use of std::isalnum, but is nevertheless incorrect for multibyte
> encodings that reuse valid leading code unit values as trailing code unit
> values (e.g.; Shift-JIS).
>
> void f(const char *s) {
> while (*s) {
> if (std::isalnum(static_cast<unsigned char>(*s++)) {
> ...
> }
> }
> }
>
> Use of a distinct type for code points that is not implicitly convertible
> from a code unit type prevents these kinds of problems.
>
> Tom.
>

Received on 2018-12-05 17:12:28