C++ Logo

sg16

Advanced search

Re: Core formal unicode classifications functions

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 3 Sep 2023 09:43:45 +0200
> is_unicode_nonchar
better stick with character, char as an abbreviation only adds confusion

> is_unicode_char
while non characters are specified, characters are not. This doesn't do
much, and should be removed,
unless we want is_scalar_value instead (which is ever slow
slightly different from is_unicode_nonchar)

> is_lead/is_trail

Unicode usually goes with high/low surrogate, we might as well stick with
standard names

> is_single

Seems redundant and the name is confusing

> is_lead(char16_t codeunit)

What is the motivation for having both char32_t and char16_t overloads?

> surrogate_offset

This is too much of an implementation detail for my taste.

> get_supplementary

We probably want preconditions on that, instead of making it unspecified.
Do we need a better name?

> is_utf8_lead
> is_utf8_trail

What are the use cases for that?


> count_utf8_trail_bytes
> count_utf8_trail_bytes_unsafe

llvm has a single getNumBytesForUTF8(char8_t) function that returns the
total number of bytes, incleading the leading one.
In terms of having safe and unsafe variants, it's not really needed, but we
should have a precondition.
The implementation can compute a result without comparisons (llvm use a
precomputed table for example)


On Sun, Sep 3, 2023 at 5:06 AM Steve Downey via SG16 <sg16_at_[hidden]>
wrote:

> Paper to follow:
>
> https://github.com/steve-downey/unicode-formal/blob/main/src/unicode_formal/unicode_formal.h
>
> Based on the ICU C macros, but once types are added, some of them are not
> very useful, or plain harmful.
>
> These are all classifications and counts, no decode or encode, or validity
> checks above code unit.
>
> I haven't added the various named block functions, as I don't really see a
> particular point to them outside of short-circuit real unicode database
> functions.
>
> They're all constexpr inline noexcept. Some may have 'unspecified' results
> where that means the value is deterministic but meaningless. There is no
> undefined behavior.
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2023-09-03 07:43:59