ISOCPP sg16 List: Re: Formal properties library for Unicode

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 29 Aug 2023 07:36:23 -0400

On Tue, Aug 29, 2023, 06:33 Jens Maurer <jens.maurer_at_[hidden]> wrote:

>
>
> On 29/08/2023 02.44, Steve Downey via SG16 wrote:
> > On my todo list for a couple months. I had some time to think about it,
> and want some feedback on what I'm thinking before getting too deep. What I
> believe we need are the collection of classification functions that only
> depend on the form of codepoints, or for UTF-16 and -8 code units. In ICU
> these are often macros, which for C++ should be inline functions. I believe
> these should be wide contract, and noexcept, which implies that, for
> example, an `is_high_surrogate` would return false for a char32_t above the
> code point range. ICU also has `safe` vs `` versions of macros, which I
> believe should be reversed today (
> https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf8_8h.html <
> https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf8_8h.html> )
> >
> > I believe the functions should be in terms of char{8,16,32}_t, and that
> we leave byte in particular out until we deal with IO. wchar_t is at this
> point unportable, so I think it's not a good candidate either.
> >
> > Areas for functions.
> > Codepoint classification
> > scalar value, code point value, validity, encoding length, high/low
> surrogate, BOM classification
>
> These seem easy on the interface, because they're dealing with a single
> code point
> (represented as a single integer value), right?
>
> > char16_t
> > similar - nothing other than BOM miss for BE/LE UTF-16
> > char8_t
> > lead byte, trail byte, counts, etc (see ICU)
>
> Is any of those functions getting a char* or char8_t* or similar parameter?
> If not, what's the difference for the _safe vs. `` variation?
>

The safe versions will, for example, check that you have a valid trailing
byte before masking the giving you the value it contributes to the
codepoints and give you 0 rather than nonsense the unmarked version just
drops the top two bits.

If the data is known to be well formed, you avoid a branch.

The behavior is specified for the unsafe, unmarked, versions in ICU, but
garbage in/garbage out.

> Thanks,
> Jens
>
>

Received on 2023-08-29 11:36:35