C++ Logo

sg16

Advanced search

Re: Core formal unicode classifications functions

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Mon, 4 Sep 2023 09:10:15 +0200
On 04/09/2023 01.47, Steve Downey wrote:
> On Sun, Sep 3, 2023 at 2:18 AM Jens Maurer <jens.maurer_at_[hidden] <mailto:jens.maurer_at_[hidden]>> wrote:
> Where do the names come from? Does Unicode define them, somewhere?
>
>
> ICU4C has used these forever, and they've leaked out into other APIs.
>
> If not, and those are your own inventions,
> is_lead, is_trail -> add _surrogate
> and maybe "is_leading_surrogate"
>
> Current language seems to be high/low although trail / lead specifies the in-memory order.

The first level of concern for me is to have "surrogate" in the names.

I agree there is a second level of concern: whether to call them lead/trail
or low/high. An eventual paper should contain copious discussion,
so that the committee can make an informed decision.

> "Code point offset for surrogate pair calculation"
>
> I don't know what that means. Apparently, this function
> returns a constant. (It has no parameters.) Maybe a
> constexpr variable would be better? Same for utf16_max_length.
>
> Why "_formal" in the namespace name? Drop it.
>
>
> Will for standardization. Have we agreed on std::unicode with a handshake agreement that `uc` is the accepted alias?

I don't think we've agreed on anything here, and I don't understand
the latter part of your sentence. If you want a std::uc namespace
or alias, put that into your paper.

> surrogate_lead/trail:
> "The result is unspecified if the code point is not assignable."
>
> What does "assignable" mean? Can I somehow test for that before
> calling the function?
>
> Yes, it's a code point that is in the space of allocatable values, which excludes a few things between 0 and 10FFFF. It's testable here by `is_unicode_char`.

Then, define that term and use it for the specification of "is_unicode_char".

> If the results are "unspecified", we can't put contracts on those
> functions to check their preconditions. That feels like the wrong
> approach.
>
> What's the current state of contracts on noexcept functions?

The program terminates if the precondition is violated and the
user chooses not to ignore contracts.
Better than continuing with garbage values for some people.

> I'd hate to ruin someone's tight loop that is invoking ((((supplementary) >> 10) + 0xd7c0)) that ought not to throw, it just encourages avoiding the standard library.

It's a separate, future decision whether to actually attach a contract precondition
to any particular function, so I'm not seeing a problem here.

> "Does not branch."
>
> You can't make such a statement. You have no idea what the
> implementation will do.
>
> I suppose I can replace that with some ridiculous performance constraint, but the reason for its existence is that you can implement it as three tests and addition of the results that pipelines well and is cache friendly.

That's fine, but that's QoI and you can't say that normatively. Thus, don't try.

> There should be a function that checks "codepoint <= 0x10ffff".
>
>
>
> `is_unicode_char` turns out to be what you want in practice.

Then you should use it in the implementation, I guess.

> is_supplementary
>
> The static_cast should be around "codepoint", I guess.
>
>
> I think we want to lift to unsigned comparison after the base shift. I'll spend time checking this though.

It's just implementation detail, so it doesn't really matter at all.
Unless you want to show the implementation as the specification
"effects: equivalent to".

> is_lead/trail:
>
> The parens around "codeunit" can go. Also, add spaces.
> And 0xfffffc00 makes no sense for a char16_t.
> Also for surrogate_trail.
>
> It's really looking like these are pessimizations on any 32 bit architecture, so everything in the last 20 years or so of real interest. Probably best to let the char16 ones be subsumed by char32, and I'm making a note to delete some old code at work.

We still need char16_t overloads of the functions, regardless of how they're implemented.

> surrogate_offset seems to be very much an implementation
> detail of get_supplementary. Drop the former.
>
>
> I think I probably should have a struct, or free constants, with the relevant magic numbers for people re-implementing or hand fusing the algorithms. There seems to be a lot in code I can scan at work. No idea if they really should be, but they do.

I'm all for adding constants such as "first_low_surrogate" or "first_high_surrogate"
or "maximum_codepoint_value" = 0x10ffff.
(and those are currently missing, and there are a lot of literals in the comparisons,
which is bad style).

This particular constant, however, is an artifact of a particular implementation
for some of those functions. Unless you can show a use-case that is not covered
by the functions you present and that needs this particular constant, this constant
needs to go. And if you do have a use-case, maybe you're missing another of those
basic functions. (I also note that this algorithm-specific constant can be
computed from first_low_surrogate and first_high_surrogate.)

> count_utf8_trail_bytes_unsafe
>
> This looks like it allows for arbitrary "int" arguments in
> its implementation, and that's the rationale for _unsafe.
> Is there any 8-bit value >= 0xc2 that is NOT a lead byte?
> If not, remove the _unsafe interface.
>
> Yes, there are a lot, especially once the UC consortium restricted the encoding to at most 4 bytes, rather than 6.

Oh well.

> utf8_length
>
> Replace the ladder of conditional expressions with regular
> if ... else
>
> All compilers today do the right thing, so I agree.

Even if compilers wouldn't agree, this is an implementation detail.

Jens

Received on 2023-09-04 07:10:19