Dear SG 16,

The Properties and Algorithms group has discussed this issue; it will likely be proposing at UTC #175 (April 25–27) that a new definition Unicode standard encoding form be added to mean the three UTFs, and that references to the [three] Unicode encoding forms throughout The Unicode Standard be fixed using that new definition.

If such a proposal is accepted by the UTC, the resulting change would make it into a published standard no earlier than Unicode 16.0, September 2024 (this year’s 15.1 will be a more lightweight release, with no changes to the core specification).

Best regards,

Robin Leroy

Le jeu. 9 févr. 2023 à 04:03, Hubert Tong <hubert.reinterpretcast@gmail.com> a écrit :

On Wed, Feb 8, 2023 at 10:00 PM Robin Leroy via SG16 <sg16@lists.isocpp.org> wrote:

Le jeu. 9 févr. 2023 à 10:00, Corentin <corentin.jabot@gmail.com> a écrit :
Does that mean that CESU-8 is not "a Unicode encoding form"? ie we want to make sure to filter out conforming-but-not-specified-in-Unicode encodings.
CESU-8 is an encoding scheme; but I would have interpreted the language in The Unicode Standard and in UTR #17 as meaning that the Unicode encoding forms are only the three UTFs.

Indeed, the standard, like #17 quoted earlier, repeatedly uses the definite article with the term Unicode encoding forms, sometimes explicitly with the number three (the Standard has 11 occurrences of the Unicode encoding forms, and 8 occurrences of the three Unicode encoding forms).

However, D79 quoted by Jens contradicts that usage.

Le jeu. 9 févr. 2023 à 09:56, Jens Maurer <jens.maurer@gmx.net> a écrit :

D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.

I have opened an issue for the Properties and Algorithms Group (reporting to the Unicode Technical Committee) to look into this.

Thank you. We look forward to harmonization between D79 and the usage in UTR #17.

-- HT

So, any rule that maps Unicode scalar values to a unique code point
sequence is a Unicode encoding form. This certainly includes
UTF-8, UTF-16, and UTF-32, but it also includes CESU-8.
(Aside, the above should say to a unique code unit sequence, and it is not clear to me that CESU-8 should be seen as an encoding form with 8-bit code units and a trivial encoding scheme, rather than an encoding scheme for the UTF-16 encoding form; the title of UTR #26 suggests the latter.)
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16