On 2/26/23 8:00 AM, Corentin Jabot via SG16 wrote:


On Sun, Feb 26, 2023 at 1:28 AM Steve Downey via SG16 <sg16@lists.isocpp.org> wrote:
I think I agree that using the char32_t type as the Unicode scalar value type that is the lingua franca for Unicode algorithms makes sense. The advantages of a `scalar_value` would be a place to hang the contract off of, to possibly provide a checked mode, and to signal that validation has been done. 

But, it would probably just encourage people to write a range cast, like views::transform([](char32_t c){return scalar_value{c};}) which would help no one. 

Writing a validation view for scalar values is fairly trivial. 

Unicode algorithms, however, must not over trust their input then. We should not have UB in the std library. So that implies that any property lookups for int32 s that aren't scalar values have to say NO and the algorithms should handle that. Many do naturally. But we should not repeat the mistakes of the original C character classification APIs. 

As I say in my paper:
  Unicode algorithms are well-defined on any code points, including lone surrogates. Not producing lone surrogates is a post-condition of transcoding, not a precondition of algorithms.

Surrogates have the general category Cs and in general have the defaulted value for other properties.
The can be cased, normalized, clusterized, etc (these transformations produce their identity, usually)

But it should be an error to encode, or decode, a scalar value, even if unicode algorithms are well defined over them

I think you meant a lone surrogate value here.

I agree for general text processing, but not for implementations of the Unicode algorithms; those should just implement the algorithms as specified.

Tom.