Date: Tue, 29 Aug 2023 12:33:41 +0200
On 29/08/2023 02.44, Steve Downey via SG16 wrote:
> On my todo list for a couple months. I had some time to think about it, and want some feedback on what I'm thinking before getting too deep. What I believe we need are the collection of classification functions that only depend on the form of codepoints, or for UTF-16 and -8 code units. In ICU these are often macros, which for C++ should be inline functions. I believe these should be wide contract, and noexcept, which implies that, for example, an `is_high_surrogate` would return false for a char32_t above the code point range. ICU also has `safe` vs `` versions of macros, which I believe should be reversed today ( https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf8_8h.html <https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf8_8h.html> )
>
> I believe the functions should be in terms of char{8,16,32}_t, and that we leave byte in particular out until we deal with IO. wchar_t is at this point unportable, so I think it's not a good candidate either.
>
> Areas for functions.
> Codepoint classification
> scalar value, code point value, validity, encoding length, high/low surrogate, BOM classification
These seem easy on the interface, because they're dealing with a single code point
(represented as a single integer value), right?
> char16_t
> similar - nothing other than BOM miss for BE/LE UTF-16
> char8_t
> lead byte, trail byte, counts, etc (see ICU)
Is any of those functions getting a char* or char8_t* or similar parameter?
If not, what's the difference for the _safe vs. `` variation?
Thanks,
Jens
> On my todo list for a couple months. I had some time to think about it, and want some feedback on what I'm thinking before getting too deep. What I believe we need are the collection of classification functions that only depend on the form of codepoints, or for UTF-16 and -8 code units. In ICU these are often macros, which for C++ should be inline functions. I believe these should be wide contract, and noexcept, which implies that, for example, an `is_high_surrogate` would return false for a char32_t above the code point range. ICU also has `safe` vs `` versions of macros, which I believe should be reversed today ( https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf8_8h.html <https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf8_8h.html> )
>
> I believe the functions should be in terms of char{8,16,32}_t, and that we leave byte in particular out until we deal with IO. wchar_t is at this point unportable, so I think it's not a good candidate either.
>
> Areas for functions.
> Codepoint classification
> scalar value, code point value, validity, encoding length, high/low surrogate, BOM classification
These seem easy on the interface, because they're dealing with a single code point
(represented as a single integer value), right?
> char16_t
> similar - nothing other than BOM miss for BE/LE UTF-16
> char8_t
> lead byte, trail byte, counts, etc (see ICU)
Is any of those functions getting a char* or char8_t* or similar parameter?
If not, what's the difference for the _safe vs. `` variation?
Thanks,
Jens
Received on 2023-08-29 10:33:47