On Sun, Sep 3, 2023 at 3:43 AM Corentin Jabot <corentinjabot@gmail.com> wrote:
> is_unicode_nonchar
better stick with character, char as an abbreviation only adds confusion

Tentatively agree 
> is_unicode_char
while non characters are specified, characters are not. This doesn't do much, and should be removed,
unless we want is_scalar_value instead (which is ever slow slightly different from is_unicode_nonchar)

This is also spelled isLegal in Java and in other APIs. I don't like that term. Welcome a better one. 
 
> is_lead/is_trail

Unicode usually goes with high/low surrogate, we might as well stick with standard names

ICU uses this, but I'm fine although I always have to check if high should be before low in the stream. 
 
> is_single

Seems redundant and the name is confusing

> is_lead(char16_t codeunit)

What is the motivation for having both char32_t and char16_t overloads?

Ancient architectures, apparently. 
 
> surrogate_offset 

This is too much of an implementation detail for my taste. 

People end up using them, for good or ill. But it should probably be a struct of specialized constants. 
 
> get_supplementary

We probably want preconditions on that, instead of making it unspecified. Do we need a better name?

> is_utf8_lead
> is_utf8_trail

What are the use cases for that?


Parsing utf-8 streams
 
> count_utf8_trail_bytes
> count_utf8_trail_bytes_unsafe

llvm has a single getNumBytesForUTF8(char8_t) function that returns the total number of bytes, incleading the leading one.
In terms of having safe and unsafe variants, it's not really needed, but we should have a precondition.
The implementation can compute a result without comparisons (llvm use a precomputed table for example)
Looks like in llvm that's not the primitive, either. Which makes sense, you want to know how many _more_ bytes you need to read. 
https://llvm.org/doxygen/ConvertUTF_8cpp_source.html#l00545 

unsigned getNumBytesForUTF8(UTF8 first) {
return trailingBytesForUTF8[first] + 1;
} 



On Sun, Sep 3, 2023 at 5:06 AM Steve Downey via SG16 <sg16@lists.isocpp.org> wrote:
Paper to follow:
https://github.com/steve-downey/unicode-formal/blob/main/src/unicode_formal/unicode_formal.h

Based on the ICU C macros, but once types are added, some of them are not very useful, or plain harmful. 

These are all classifications and counts, no decode or encode, or validity checks above code unit. 

I haven't added the various named block functions, as I don't really see a particular point to them outside of short-circuit real unicode database functions. 

They're all constexpr inline noexcept. Some may have 'unspecified' results where that means the value is deterministic but meaningless. There is no undefined behavior. 


--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16