Date: Tue, 28 Feb 2023 17:59:24 +0100
On Tue, Feb 28, 2023 at 5:47 PM Tom Honermann <tom_at_[hidden]> wrote:
> On 2/26/23 8:00 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Sun, Feb 26, 2023 at 1:28 AM Steve Downey via SG16 <
> sg16_at_[hidden]> wrote:
>
>> I think I agree that using the char32_t type as the Unicode scalar value
>> type that is the lingua franca for Unicode algorithms makes sense. The
>> advantages of a `scalar_value` would be a place to hang the contract off
>> of, to possibly provide a checked mode, and to signal that validation has
>> been done.
>>
>> But, it would probably just encourage people to write a range cast, like
>> views::transform([](char32_t c){return scalar_value{c};}) which would help
>> no one.
>>
>> Writing a validation view for scalar values is fairly trivial.
>>
>> Unicode algorithms, however, must not over trust their input then. We
>> should not have UB in the std library. So that implies that any property
>> lookups for int32 s that aren't scalar values have to say NO and the
>> algorithms should handle that. Many do naturally. But we should not repeat
>> the mistakes of the original C character classification APIs.
>>
>
> As I say in my paper:
> Unicode algorithms are well-defined on any code points, including lone
> surrogates. Not producing lone surrogates is a post-condition of
> transcoding, not a precondition of algorithms.
>
> Surrogates have the general category Cs and in general have the defaulted
> value for other properties.
> The can be cased, normalized, clusterized, etc (these transformations
> produce their identity, usually)
>
> But it should be an error to encode, or decode, a scalar value, even if
> unicode algorithms are well defined over them
>
> I think you meant a lone surrogate value here.
>
Yes. *facepalm*
> I agree for general text processing, but not for implementations of the
> Unicode algorithms; those should just implement the algorithms as specified.
>
I'm not sure what you mean here
> Tom.
>
>
>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>
> On 2/26/23 8:00 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Sun, Feb 26, 2023 at 1:28 AM Steve Downey via SG16 <
> sg16_at_[hidden]> wrote:
>
>> I think I agree that using the char32_t type as the Unicode scalar value
>> type that is the lingua franca for Unicode algorithms makes sense. The
>> advantages of a `scalar_value` would be a place to hang the contract off
>> of, to possibly provide a checked mode, and to signal that validation has
>> been done.
>>
>> But, it would probably just encourage people to write a range cast, like
>> views::transform([](char32_t c){return scalar_value{c};}) which would help
>> no one.
>>
>> Writing a validation view for scalar values is fairly trivial.
>>
>> Unicode algorithms, however, must not over trust their input then. We
>> should not have UB in the std library. So that implies that any property
>> lookups for int32 s that aren't scalar values have to say NO and the
>> algorithms should handle that. Many do naturally. But we should not repeat
>> the mistakes of the original C character classification APIs.
>>
>
> As I say in my paper:
> Unicode algorithms are well-defined on any code points, including lone
> surrogates. Not producing lone surrogates is a post-condition of
> transcoding, not a precondition of algorithms.
>
> Surrogates have the general category Cs and in general have the defaulted
> value for other properties.
> The can be cased, normalized, clusterized, etc (these transformations
> produce their identity, usually)
>
> But it should be an error to encode, or decode, a scalar value, even if
> unicode algorithms are well defined over them
>
> I think you meant a lone surrogate value here.
>
Yes. *facepalm*
> I agree for general text processing, but not for implementations of the
> Unicode algorithms; those should just implement the algorithms as specified.
>
I'm not sure what you mean here
> Tom.
>
>
>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>
Received on 2023-02-28 16:59:38