C++ Logo

sg16

Advanced search

Re: char32_t as the scalar value type

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 26 Feb 2023 14:00:02 +0100
On Sun, Feb 26, 2023 at 1:28 AM Steve Downey via SG16 <sg16_at_[hidden]>
wrote:

> I think I agree that using the char32_t type as the Unicode scalar value
> type that is the lingua franca for Unicode algorithms makes sense. The
> advantages of a `scalar_value` would be a place to hang the contract off
> of, to possibly provide a checked mode, and to signal that validation has
> been done.
>
> But, it would probably just encourage people to write a range cast, like
> views::transform([](char32_t c){return scalar_value{c};}) which would help
> no one.
>
> Writing a validation view for scalar values is fairly trivial.
>
> Unicode algorithms, however, must not over trust their input then. We
> should not have UB in the std library. So that implies that any property
> lookups for int32 s that aren't scalar values have to say NO and the
> algorithms should handle that. Many do naturally. But we should not repeat
> the mistakes of the original C character classification APIs.
>

As I say in my paper:
  Unicode algorithms are well-defined on any code points, including lone
surrogates. Not producing lone surrogates is a post-condition of
transcoding, not a precondition of algorithms.

Surrogates have the general category Cs and in general have the defaulted
value for other properties.
The can be cased, normalized, clusterized, etc (these transformations
produce their identity, usually)

But it should be an error to encode, or decode, a scalar value, even if
unicode algorithms are well defined over them


> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2023-02-26 13:00:16