> is_unicode_nonchar
better stick with character, char as an abbreviation only adds confusion
Tentatively agree
> is_unicode_char
while non characters are specified, characters are not. This doesn't do much, and should be removed,
unless we want is_scalar_value instead (which is ever slow slightly different from is_unicode_nonchar)
This is also spelled isLegal in Java and in other APIs. I don't like that term. Welcome a better one.
> is_lead/is_trail
Unicode usually goes with high/low surrogate, we might as well stick with standard names
ICU uses this, but I'm fine although I always have to check if high should be before low in the stream.
> is_single
Seems redundant and the name is confusing
> is_lead(char16_t codeunit)
What is the motivation for having both char32_t and char16_t overloads?
Ancient architectures, apparently.
> surrogate_offset
This is too much of an implementation detail for my taste.
People end up using them, for good or ill. But it should probably be a struct of specialized constants.
> get_supplementary
We probably want preconditions on that, instead of making it unspecified. Do we need a better name?
> is_utf8_lead
> is_utf8_trail
What are the use cases for that?
Parsing utf-8 streams
> count_utf8_trail_bytes
> count_utf8_trail_bytes_unsafe
llvm has a single getNumBytesForUTF8(char8_t) function that returns the total number of bytes, incleading the leading one.
In terms of having safe and unsafe variants, it's not really needed, but we should have a precondition.
The implementation can compute a result without comparisons (llvm use a precomputed table for example)
Looks like in llvm that's not the primitive, either. Which makes sense, you want to know how many _more_ bytes you need to read.