ISOCPP sg16 List: Re: char32_t as the scalar value type

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 28 Feb 2023 11:46:58 -0500

I'm content with use of char32_t as "the" Unicode code point and Unicode
scalar value type (I see no need to specify distinct types to
differentiate those). If someone feels the need to introduce a type
alias to differentiate the latter for documentation purposes, great.
Otherwise, I think concepts, contracts, and assertions suffice to opt-in
to the constraints of the latter.

I have two conditions on the above:

1. That we agree that char32_t is used as a *Unicode* code point and
    that we don't encourage its use with non-Unicode encodings (I would
    be much surprised to hear dissenting opinions on that).
2. That we leave room for a user-defined type to be used as a
    (non-Unicode) code point for other character sets such that it is
    possible to infer a character set for those types the same way the
    Unicode character set can be inferred for char32_t (e.g., via a
    trait). This ability is intended only to support generic text
    processing, not Unicode specific text processing (e.g., Unicode
    algorithms can directly require char32_t, not some generic code
    point concept). I think it would be useful to (notionally, if not
    explicitly via a concept) define what operations are applicable to,
    or required by, a code point type.

The use of char32_t as both a code unit type and a code point type does
imply that interfaces will not be able to distinguish between those
uses. I think that is ok; I haven't been able to think of a scenario in
which either a caller or callee would not know whether an operation is
intended to work on code units vs code points.

I think it is useful to think of char32_t as a "character" (not
"integer", not "code point") type that has an associated character set
and that holds a code point value (that is an "integer", not "character"
type). However, encoding those distinctions in the type system creates
friction and likely produces little if any value (at least in C++ given
its long history of implementing character types as integer types). I
took this approach with my text_view implementation and, in retrospect,
that was probably only useful to implement the "any" character type
(that has a dynamically associated character set); that support, if
useful in the first place, is likely better implemented via type erasure
anyway.

Tom.

On 2/25/23 7:28 PM, Steve Downey via SG16 wrote:
> I think I agree that using the char32_t type as the Unicode scalar
> value type that is the lingua franca for Unicode algorithms makes
> sense. The advantages of a `scalar_value` would be a place to hang the
> contract off of, to possibly provide a checked mode, and to signal
> that validation has been done.
>
> But, it would probably just encourage people to write a range cast,
> like views::transform([](char32_t c){return scalar_value{c};}) which
> would help no one.
>
> Writing a validation view for scalar values is fairly trivial.
>
> Unicode algorithms, however, must not over trust their input then. We
> should not have UB in the std library. So that implies that any
> property lookups for int32 s that aren't scalar values have to say NO
> and the algorithms should handle that. Many do naturally. But we
> should not repeat the mistakes of the original C character
> classification APIs.
>

Received on 2023-02-28 16:47:01