C++ Logo

sg16

Advanced search

Re: Term for "UTF-8, UTF-16 and UTF-32"

From: Robin Leroy <egg.robin.leroy_at_[hidden]>
Date: Fri, 24 Feb 2023 10:45:06 +0100
Dear SG 16,

The Properties and Algorithms group has discussed this issue; it will
likely be proposing at UTC #175 (April 25–27) that a new definition *Unicode
standard encoding form* be added to mean the three UTFs, and that
references to *the [three] Unicode encoding forms* throughout The Unicode
Standard be fixed using that new definition.
If such a proposal is accepted by the UTC, the resulting change would make
it into a published standard no earlier than Unicode 16.0, September 2024
(this year’s 15.1 will be a more lightweight release, with no changes to
the core specification).

Best regards,

Robin Leroy

Le jeu. 9 févr. 2023 à 04:03, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
a écrit :

> On Wed, Feb 8, 2023 at 10:00 PM Robin Leroy via SG16 <
> sg16_at_[hidden]> wrote:
>
>>
>> Le jeu. 9 févr. 2023 à 10:00, Corentin <corentin.jabot_at_[hidden]> a
>> écrit :
>>
>>> Does that mean that CESU-8 is not "a Unicode encoding form"? ie we want
>>> to make sure to filter out conforming-but-not-specified-in-Unicode
>>> encodings.
>>>
>> CESU-8 is an encoding scheme; but I would have interpreted the language
>> in The Unicode Standard and in UTR #17 as meaning that the Unicode encoding
>> forms are only the three UTFs.
>>
>> Indeed, the standard, like #17 quoted earlier, repeatedly uses the
>> definite article with the term Unicode encoding forms, sometimes explicitly
>> with the number three (the Standard has 11 occurrences of *the Unicode
>> encoding forms*, and 8 occurrences of *the three Unicode encoding forms*
>> ).
>>
>> However, D79 quoted by Jens contradicts that usage.
>>
>> Le jeu. 9 févr. 2023 à 09:56, Jens Maurer <jens.maurer_at_[hidden]> a écrit :
>>
>>>
>>> D79 A Unicode encoding form assigns each Unicode scalar value to a
>>> unique code unit sequence.
>>
>>
>> I have opened an issue for the Properties and Algorithms Group
>> <https://www.unicode.org/consortium/props-algorithms.html> (reporting to
>> the Unicode Technical Committee) to look into this.
>>
>
> Thank you. We look forward to harmonization between D79 and the usage in
> UTR #17.
>
> -- HT
>
>
>>
>>
>>> So, any rule that maps Unicode scalar values to a unique code point
>>> sequence is a Unicode encoding form. This certainly includes
>>> UTF-8, UTF-16, and UTF-32, but it also includes CESU-8.
>>
>> (Aside, the above should say *to a unique code unit sequence*, and it is
>> not clear to me that CESU-8 should be seen as an encoding form with 8-bit
>> code units and a trivial encoding scheme, rather than an encoding scheme
>> for the UTF-16 encoding form; the title of UTR #26 suggests the latter.)
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2023-02-24 09:45:23