C++ Logo


Advanced search

Re: char32_t as the scalar value type

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 28 Feb 2023 11:47:37 -0500
On 2/26/23 8:00 AM, Corentin Jabot via SG16 wrote:
> On Sun, Feb 26, 2023 at 1:28 AM Steve Downey via SG16
> <sg16_at_[hidden]> wrote:
> I think I agree that using the char32_t type as the Unicode scalar
> value type that is the lingua franca for Unicode algorithms makes
> sense. The advantages of a `scalar_value` would be a place to hang
> the contract off of, to possibly provide a checked mode, and to
> signal that validation has been done.
> But, it would probably just encourage people to write a range
> cast, like views::transform([](char32_t c){return
> scalar_value{c};}) which would help no one.
> Writing a validation view for scalar values is fairly trivial.
> Unicode algorithms, however, must not over trust their input then.
> We should not have UB in the std library. So that implies that any
> property lookups for int32 s that aren't scalar values have to say
> NO and the algorithms should handle that. Many do naturally. But
> we should not repeat the mistakes of the original C character
> classification APIs.
> As I say in my paper:
> Unicode algorithms are well-defined on any code points, including
> lone surrogates. Not producing lone surrogates is a post-condition of
> transcoding, not a precondition of algorithms.
> Surrogates have the general category Cs and in general have the
> defaulted value for other properties.
> The can be cased, normalized, clusterized, etc (these transformations
> produce their identity, usually)
> But it should be an error to encode, or decode, a scalar value, even
> if unicode algorithms are well defined over them

I think you meant a lone surrogate value here.

I agree for general text processing, but not for implementations of the
Unicode algorithms; those should just implement the algorithms as specified.


> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2023-02-28 16:47:40