C++ Logo

sg16

Advanced search

Re: char32_t as the scalar value type

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 28 Feb 2023 12:04:39 -0500
On 2/28/23 11:59 AM, Corentin Jabot via SG16 wrote:
>
>
> On Tue, Feb 28, 2023 at 5:47 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 2/26/23 8:00 AM, Corentin Jabot via SG16 wrote:
>>
>>
>> On Sun, Feb 26, 2023 at 1:28 AM Steve Downey via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> I think I agree that using the char32_t type as the Unicode
>> scalar value type that is the lingua franca for Unicode
>> algorithms makes sense. The advantages of a `scalar_value`
>> would be a place to hang the contract off of, to possibly
>> provide a checked mode, and to signal that validation has
>> been done.
>>
>> But, it would probably just encourage people to write a range
>> cast, like views::transform([](char32_t c){return
>> scalar_value{c};}) which would help no one.
>>
>> Writing a validation view for scalar values is fairly trivial.
>>
>> Unicode algorithms, however, must not over trust their input
>> then. We should not have UB in the std library. So that
>> implies that any property lookups for int32 s that aren't
>> scalar values have to say NO and the algorithms should handle
>> that. Many do naturally. But we should not repeat the
>> mistakes of the original C character classification APIs.
>>
>>
>> As I say in my paper:
>> Unicode algorithms are well-defined on any code points,
>> including lone surrogates. Not producing lone surrogates is a
>> post-condition of transcoding, not a precondition of algorithms.
>>
>> Surrogates have the general category Cs and in general have the
>> defaulted value for other properties.
>> The can be cased, normalized, clusterized, etc (these
>> transformations produce their identity, usually)
>>
>> But it should be an error to encode, or decode, a scalar value,
>> even if unicode algorithms are well defined over them
>
> I think you meant a lone surrogate value here.
>
>
> Yes. *facepalm*
>
> I agree for general text processing, but not for implementations
> of the Unicode algorithms; those should just implement the
> algorithms as specified.
>
>
> I'm not sure what you mean here

I meant that an implementation of a Unicode algorithm should not produce
an error if a lone surrogate is encountered if the algorithm is
well-defined for such code points.

Tom.

> Tom.
>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>

Received on 2023-02-28 17:04:40