C++ Logo

sg16

Advanced search

Re: Core formal unicode classifications functions

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Sun, 3 Sep 2023 08:18:50 +0200
On 03/09/2023 05.06, Steve Downey via SG16 wrote:
> Paper to follow:
> https://github.com/steve-downey/unicode-formal/blob/main/src/unicode_formal/unicode_formal.h <https://github.com/steve-downey/unicode-formal/blob/main/src/unicode_formal/unicode_formal.h>
>
> Based on the ICU C macros, but once types are added, some of them are not very useful, or plain harmful.
>
> These are all classifications and counts, no decode or encode, or validity checks above code unit.
>
> I haven't added the various named block functions, as I don't really see a particular point to them outside of short-circuit real unicode database functions.
>
> They're all constexpr inline noexcept. Some may have 'unspecified' results where that means the value is deterministic but meaningless. There is no undefined behavior.

Drop the "inline"; it is an implementation detail.

Where do the names come from? Does Unicode define them, somewhere?
If not, and those are your own inventions,
  is_lead, is_trail -> add _surrogate
and maybe "is_leading_surrogate"


"Code point offset for surrogate pair calculation"

I don't know what that means. Apparently, this function
returns a constant. (It has no parameters.) Maybe a
constexpr variable would be better? Same for utf16_max_length.

Why "_formal" in the namespace name? Drop it.

surrogate_lead/trail:
"The result is unspecified if the code point is not assignable."

What does "assignable" mean? Can I somehow test for that before
calling the function?
Shouldn't the function indicate an error when that happens?
Would it be more useful to have a

  std::tuple<lead, trail, error> split_supplementary()

function, usable with structured bindings?

If the results are "unspecified", we can't put contracts on those
functions to check their preconditions. That feels like the wrong
approach.


"leadByte"

The standard doesn't do camelCase.


"Does not branch."

You can't make such a statement. You have no idea what the
implementation will do.

There should be a function that checks "codepoint <= 0x10ffff".


is_supplementary

The static_cast should be around "codepoint", I guess.

is_lead/trail:

The parens around "codeunit" can go. Also, add spaces.
And 0xfffffc00 makes no sense for a char16_t.
Also for surrogate_trail.

surrogate_offset seems to be very much an implementation
detail of get_supplementary. Drop the former.


count_utf8_trail_bytes_unsafe

This looks like it allows for arbitrary "int" arguments in
its implementation, and that's the rationale for _unsafe.
Is there any 8-bit value >= 0xc2 that is NOT a lead byte?
If not, remove the _unsafe interface.

utf8_length

Replace the ladder of conditional expressions with regular
if ... else


Jens

Received on 2023-09-03 06:18:55