On Fri, Oct 7, 2022 at 12:14 AM Thiago Macieira via SG16 <sg16@lists.isocpp.org> wrote:

On Thursday, 6 October 2022 14:13:07 PDT Dimitrij Mijoski via SG16 wrote:
> The functions will fall in four categories:
>
> 1. *Functions that decode/encode only one code point.* ICU has this
> functionality in utf8.h
>
> <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html>
> and utf16.h
>
> <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf16_8h.html
> >. but it is implemented as macros that are not type-safe.

I don't see how this can safely be done because UTF-8 and UTF-16 might need
multiple code units to represent a single code point. Therefore, conversions
are inherently a string operation.

Maybe you mixed up things. The functions do work with strings of code units. The decoding functions for UTF-8 (in group #1) accept a string of code units, read 1-4 units and return only one code point. That is a perfectly well defined operation. The encoding functions do the opposite.

Unless there's a strong use-case for these, I'd simply forego them completely.
But since this is what you've dedicated the majority of your email to, you may
feel there is a need for them. Can you elaborate?

Iterative conversion could probably be implemented with a stateful chunk API
(your #4) where the input chunks are of size 1, though output chunks should
probably never be less than 4 bytes (2 char16_t or 1 char32_t) long.

The functions in the four groups have different performance characteristics in different scenarios, and in a given scenario, the functions from one group are better than the others in terms of performance. The first group, which is my focus now, is absolutely the lowest level and has zero overhead. If they are used as intended, as I have shown in the examples, they have no unnecessary bounds checking, i.e. a string with N code units is iterated with N+1 checks. So, the functions in the higher groups should be implemented with those from the lower groups. The group #4, API for decoding in chunks, keeps a state between calls and has the overhead of saving the execution state to a data variable before returning and overhead of restoring the execution state from a data variable. You can implement #1 with #4, but it is suboptimal.

Let's put performance aside and go to usability. I will show you 2 scenarios when these functions come very handy.

Problem 1: Given a UTF-8 encoded string in std::string, replace each second code point with a randomly generated code point, inplace.

Problem 2: More realistic example, given a UTF-8 encoded string, replace/erase each code point that satisfies certain Unicode character property.

These functions are very useful if we want to directly manipulate UTF-8 encoded strings, without using a temporary UTF-32 string.