C++ Logo

sg16

Advanced search

Re: Core formal unicode classifications functions

From: Steve Downey <sdowney_at_[hidden]>
Date: Sun, 3 Sep 2023 20:50:17 -0400
On Sun, Sep 3, 2023 at 3:43 AM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

> > is_unicode_nonchar
> better stick with character, char as an abbreviation only adds confusion
>
> Tentatively agree

> > is_unicode_char
> while non characters are specified, characters are not. This doesn't do
> much, and should be removed,
> unless we want is_scalar_value instead (which is ever slow
> slightly different from is_unicode_nonchar)
>
> This is also spelled isLegal in Java and in other APIs. I don't like that
term. Welcome a better one.


> > is_lead/is_trail
>
> Unicode usually goes with high/low surrogate, we might as well stick with
> standard names
>
> ICU uses this, but I'm fine although I always have to check if high should
be before low in the stream.


> > is_single
>
> Seems redundant and the name is confusing
>
> > is_lead(char16_t codeunit)
>
> What is the motivation for having both char32_t and char16_t overloads?
>
> Ancient architectures, apparently.


> > surrogate_offset
>
> This is too much of an implementation detail for my taste.
>
> People end up using them, for good or ill. But it should probably be a
struct of specialized constants.


> > get_supplementary
>
> We probably want preconditions on that, instead of making it unspecified.
> Do we need a better name?
>
> > is_utf8_lead
> > is_utf8_trail
>
> What are the use cases for that?
>
>
> Parsing utf-8 streams


> > count_utf8_trail_bytes
> > count_utf8_trail_bytes_unsafe
>
> llvm has a single getNumBytesForUTF8(char8_t) function that returns the
> total number of bytes, incleading the leading one.
> In terms of having safe and unsafe variants, it's not really needed, but
> we should have a precondition.
> The implementation can compute a result without comparisons (llvm use a
> precomputed table for example)
>
Looks like in llvm that's not the primitive, either. Which makes sense, you
want to know how many _more_ bytes you need to read.
https://llvm.org/doxygen/ConvertUTF_8cpp_source.html#l00545

unsigned getNumBytesForUTF8
<https://llvm.org/doxygen/namespacellvm.html#a32e798e98caac5726958f91abbb5a98f>
(UTF8
<https://llvm.org/doxygen/namespacellvm.html#ad9748bf198e8fae8a64c80a0720d4012>
first) {
return trailingBytesForUTF8
<https://llvm.org/doxygen/namespacellvm.html#a4a52f709b86c76671fe0a0ac2c46976e>[first]
+ 1;
}



> On Sun, Sep 3, 2023 at 5:06 AM Steve Downey via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Paper to follow:
>>
>> https://github.com/steve-downey/unicode-formal/blob/main/src/unicode_formal/unicode_formal.h
>>
>> Based on the ICU C macros, but once types are added, some of them are not
>> very useful, or plain harmful.
>>
>> These are all classifications and counts, no decode or encode, or
>> validity checks above code unit.
>>
>> I haven't added the various named block functions, as I don't really see
>> a particular point to them outside of short-circuit real unicode database
>> functions.
>>
>> They're all constexpr inline noexcept. Some may have 'unspecified'
>> results where that means the value is deterministic but meaningless. There
>> is no undefined behavior.
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2023-09-04 00:50:29