C++ Logo

sg16

Advanced search

Re: Proposal for low-level Unicode decoding and encoding

From: Thiago Macieira <thiago_at_[hidden]>
Date: Thu, 06 Oct 2022 15:14:23 -0700
On Thursday, 6 October 2022 14:13:07 PDT Dimitrij Mijoski via SG16 wrote:
> The functions will fall in four categories:
>
> 1. *Functions that decode/encode only one code point.* ICU has this
> functionality in utf8.h
>
> <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html>
> and utf16.h
>
> <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf16_8h.html
> >. but it is implemented as macros that are not type-safe.

I don't see how this can safely be done because UTF-8 and UTF-16 might need
multiple code units to represent a single code point. Therefore, conversions
are inherently a string operation.

Unless there's a strong use-case for these, I'd simply forego them completely.
But since this is what you've dedicated the majority of your email to, you may
feel there is a need for them. Can you elaborate?

Iterative conversion could probably be implemented with a stateful chunk API
(your #4) where the input chunks are of size 1, though output chunks should
probably never be less than 4 bytes (2 char16_t or 1 char32_t) long.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DCAI Cloud Engineering

Received on 2022-10-06 22:14:25