C++ Logo

sg16

Advanced search

Formal properties library for Unicode

From: Steve Downey <sdowney_at_[hidden]>
Date: Mon, 28 Aug 2023 20:44:06 -0400
On my todo list for a couple months. I had some time to think about it, and
want some feedback on what I'm thinking before getting too deep. What I
believe we need are the collection of classification functions that only
depend on the form of codepoints, or for UTF-16 and -8 code units. In ICU
these are often macros, which for C++ should be inline functions. I believe
these should be wide contract, and noexcept, which implies that, for
example, an `is_high_surrogate` would return false for a char32_t above the
code point range. ICU also has `safe` vs `` versions of macros, which I
believe should be reversed today (
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf8_8h.html )

I believe the functions should be in terms of char{8,16,32}_t, and that we
leave byte in particular out until we deal with IO. wchar_t is at this
point unportable, so I think it's not a good candidate either.

Areas for functions.
Codepoint classification
    scalar value, code point value, validity, encoding length, high/low
surrogate, BOM classification
char16_t
    similar - nothing other than BOM miss for BE/LE UTF-16
char8_t
     lead byte, trail byte, counts, etc (see ICU)

Received on 2023-08-29 00:44:21