C++ Logo

sg16

Advanced search

Re: Core formal unicode classifications functions

From: Steve Downey <sdowney_at_[hidden]>
Date: Sun, 3 Sep 2023 19:47:39 -0400
On Sun, Sep 3, 2023 at 2:18 AM Jens Maurer <jens.maurer_at_[hidden]> wrote:

>
>
> On 03/09/2023 05.06, Steve Downey via SG16 wrote:
> > Paper to follow:
> >
> https://github.com/steve-downey/unicode-formal/blob/main/src/unicode_formal/unicode_formal.h
> <
> https://github.com/steve-downey/unicode-formal/blob/main/src/unicode_formal/unicode_formal.h
> >
> >
> > Based on the ICU C macros, but once types are added, some of them are
> not very useful, or plain harmful.
> >
> > These are all classifications and counts, no decode or encode, or
> validity checks above code unit.
> >
> > I haven't added the various named block functions, as I don't really see
> a particular point to them outside of short-circuit real unicode database
> functions.
> >
> > They're all constexpr inline noexcept. Some may have 'unspecified'
> results where that means the value is deterministic but meaningless. There
> is no undefined behavior.
>
> Drop the "inline"; it is an implementation detail.
>
> Check. Thinko when pulling out the synopsis / forward declaration.

> Where do the names come from? Does Unicode define them, somewhere?
>

ICU4C has used these forever, and they've leaked out into other APIs.

If not, and those are your own inventions,
> is_lead, is_trail -> add _surrogate
> and maybe "is_leading_surrogate"
>
> Current language seems to be high/low although trail / lead specifies the
in-memory order.

>
> "Code point offset for surrogate pair calculation"
>
> I don't know what that means. Apparently, this function
> returns a constant. (It has no parameters.) Maybe a
> constexpr variable would be better? Same for utf16_max_length.
>
> Why "_formal" in the namespace name? Drop it.
>

Will for standardization. Have we agreed on std::unicode with a handshake
agreement that `uc` is the accepted alias?


> surrogate_lead/trail:
> "The result is unspecified if the code point is not assignable."
>
> What does "assignable" mean? Can I somehow test for that before
> calling the function?
>
Yes, it's a code point that is in the space of allocatable values, which
excludes a few things between 0 and 10FFFF. It's testable here by
`is_unicode_char`.

> Shouldn't the function indicate an error when that happens?
> Would it be more useful to have a
>
> std::tuple<lead, trail, error> split_supplementary()
>
> function, usable with structured bindings?
>
> Noted, and is sensible.

> If the results are "unspecified", we can't put contracts on those
> functions to check their preconditions. That feels like the wrong
> approach.
>
> What's the current state of contracts on noexcept functions? I'd hate to
ruin someone's tight loop that is invoking ((((supplementary) >> 10) +
0xd7c0)) that ought not to throw, it just encourages avoiding the standard
library.

>
> "leadByte"
>
> The standard doesn't do camelCase.
>
>
> Check

> "Does not branch."
>
> You can't make such a statement. You have no idea what the
> implementation will do.
>
> I suppose I can replace that with some ridiculous performance constraint,
but the reason for its existence is that you can implement it as three
tests and addition of the results that pipelines well and is cache
friendly.


> There should be a function that checks "codepoint <= 0x10ffff".
>
>
>
`is_unicode_char` turns out to be what you want in practice.


> is_supplementary
>
> The static_cast should be around "codepoint", I guess.
>

I think we want to lift to unsigned comparison after the base shift. I'll
spend time checking this though.

>
> is_lead/trail:
>
> The parens around "codeunit" can go. Also, add spaces.
> And 0xfffffc00 makes no sense for a char16_t.
> Also for surrogate_trail.
>
> It's really looking like these are pessimizations on any 32 bit
architecture, so everything in the last 20 years or so of real interest.
Probably best to let the char16 ones be subsumed by char32, and I'm making
a note to delete some old code at work.


> surrogate_offset seems to be very much an implementation
> detail of get_supplementary. Drop the former.
>
>
> I think I probably should have a struct, or free constants, with the
relevant magic numbers for people re-implementing or hand fusing the
algorithms. There seems to be a lot in code I can scan at work. No idea if
they really should be, but they do.

count_utf8_trail_bytes_unsafe
>
> This looks like it allows for arbitrary "int" arguments in
> its implementation, and that's the rationale for _unsafe.
> Is there any 8-bit value >= 0xc2 that is NOT a lead byte?
> If not, remove the _unsafe interface.
>
> Yes, there are a lot, especially once the UC consortium restricted the
encoding to at most 4 bytes, rather than 6.


> utf8_length
>
> Replace the ladder of conditional expressions with regular
> if ... else
>
> All compilers today do the right thing, so I agree.

>
> Jens
>

Thank you very much for the review!

Received on 2023-09-03 23:47:52