ISOCPP sg16 List: Re: Proposal for low-level Unicode decoding and encoding

From: Dimitrij Mijoski <dim.mj.p_at_[hidden]>
Date: Fri, 7 Oct 2022 22:01:25 +0200

On Fri, Oct 7, 2022 at 12:14 AM Thiago Macieira via SG16 <
sg16_at_[hidden]> wrote:

> On Thursday, 6 October 2022 14:13:07 PDT Dimitrij Mijoski via SG16 wrote:
> > The functions will fall in four categories:
> >
> > 1. *Functions that decode/encode only one code point.* ICU has this
> > functionality in utf8.h
> >
> > <
> https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html>
> > and utf16.h
> >
> > <
> https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf16_8h.html
> > >. but it is implemented as macros that are not type-safe.
>
> I don't see how this can safely be done because UTF-8 and UTF-16 might
> need
> multiple code units to represent a single code point. Therefore,
> conversions
> are inherently a string operation.
>

Maybe you mixed up things. The functions do work with strings of code
units. The decoding functions for UTF-8 (in group #1) accept a string of
code units, read 1-4 units and return only one code point. That is a
perfectly well defined operation. The encoding functions do the opposite.

>
> Unless there's a strong use-case for these, I'd simply forego them
> completely.
> But since this is what you've dedicated the majority of your email to, you
> may
> feel there is a need for them. Can you elaborate?
>

> Iterative conversion could probably be implemented with a stateful chunk
> API
> (your #4) where the input chunks are of size 1, though output chunks
> should
> probably never be less than 4 bytes (2 char16_t or 1 char32_t) long.
>

The functions in the four groups have different performance characteristics
in different scenarios, and in a given scenario, the functions from one
group are better than the others in terms of performance. The first group,
which is my focus now, is absolutely the lowest level and has zero
overhead. If they are used as intended, as I have shown in the examples,
they have no unnecessary bounds checking, i.e. a string with N code units
is iterated with N+1 checks. So, the functions in the higher groups should
be implemented with those from the lower groups. The group #4, API for
decoding in chunks, keeps a state between calls and has the overhead of
saving the execution state to a data variable before returning and overhead
of restoring the execution state from a data variable. You can implement #1
with #4, but it is suboptimal.

Let's put performance aside and go to usability. I will show you 2
scenarios when these functions come very handy.

Problem 1: Given a UTF-8 encoded string in std::string, replace each second
code point with a randomly generated code point, inplace.
Problem 2: More realistic example, given a UTF-8 encoded string,
replace/erase each code point that satisfies certain Unicode character
property.

These functions are very useful if we want to directly manipulate UTF-8
encoded strings, without using a temporary UTF-32 string.

Received on 2022-10-07 20:01:38