C++ Logo


Advanced search

Re: Term for "UTF-8, UTF-16 and UTF-32"

From: Robin Leroy <egg.robin.leroy_at_[hidden]>
Date: Thu, 9 Feb 2023 10:59:58 +0800
Le jeu. 9 févr. 2023 à 10:00, Corentin <corentin.jabot_at_[hidden]> a écrit :

> Does that mean that CESU-8 is not "a Unicode encoding form"? ie we want to
> make sure to filter out conforming-but-not-specified-in-Unicode encodings.
CESU-8 is an encoding scheme; but I would have interpreted the language in
The Unicode Standard and in UTR #17 as meaning that the Unicode encoding
forms are only the three UTFs.

Indeed, the standard, like #17 quoted earlier, repeatedly uses the definite
article with the term Unicode encoding forms, sometimes explicitly with the
number three (the Standard has 11 occurrences of *the Unicode encoding
forms*, and 8 occurrences of *the three Unicode encoding forms*).

However, D79 quoted by Jens contradicts that usage.

Le jeu. 9 févr. 2023 à 09:56, Jens Maurer <jens.maurer_at_[hidden]> a écrit :

> D79 A Unicode encoding form assigns each Unicode scalar value to a unique
> code unit sequence.

I have opened an issue for the Properties and Algorithms Group
<https://www.unicode.org/consortium/props-algorithms.html> (reporting to
the Unicode Technical Committee) to look into this.

> So, any rule that maps Unicode scalar values to a unique code point
> sequence is a Unicode encoding form. This certainly includes
> UTF-8, UTF-16, and UTF-32, but it also includes CESU-8.

(Aside, the above should say *to a unique code unit sequence*, and it is
not clear to me that CESU-8 should be seen as an encoding form with 8-bit
code units and a trivial encoding scheme, rather than an encoding scheme
for the UTF-16 encoding form; the title of UTR #26 suggests the latter.)

Received on 2023-02-09 03:00:16