C++ Logo

sg16

Advanced search

Re: Term for "UTF-8, UTF-16 and UTF-32"

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 24 Feb 2023 11:41:39 +0100
Thank you Robin for this! It was rather quick!
I like the resolution, and that schedule would give us the time to update
the specification in 2024/2025 before the next C++ version.

Corentin

On Fri, Feb 24, 2023 at 10:45 AM Robin Leroy <egg.robin.leroy_at_[hidden]>
wrote:

> Dear SG 16,
>
> The Properties and Algorithms group has discussed this issue; it will
> likely be proposing at UTC #175 (April 25–27) that a new definition *Unicode
> standard encoding form* be added to mean the three UTFs, and that
> references to *the [three] Unicode encoding forms* throughout The Unicode
> Standard be fixed using that new definition.
> If such a proposal is accepted by the UTC, the resulting change would make
> it into a published standard no earlier than Unicode 16.0, September 2024
> (this year’s 15.1 will be a more lightweight release, with no changes to
> the core specification).
>
> Best regards,
>
> Robin Leroy
>
> Le jeu. 9 févr. 2023 à 04:03, Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> a écrit :
>
>> On Wed, Feb 8, 2023 at 10:00 PM Robin Leroy via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>>
>>> Le jeu. 9 févr. 2023 à 10:00, Corentin <corentin.jabot_at_[hidden]> a
>>> écrit :
>>>
>>>> Does that mean that CESU-8 is not "a Unicode encoding form"? ie we want
>>>> to make sure to filter out conforming-but-not-specified-in-Unicode
>>>> encodings.
>>>>
>>> CESU-8 is an encoding scheme; but I would have interpreted the language
>>> in The Unicode Standard and in UTR #17 as meaning that the Unicode encoding
>>> forms are only the three UTFs.
>>>
>>> Indeed, the standard, like #17 quoted earlier, repeatedly uses the
>>> definite article with the term Unicode encoding forms, sometimes explicitly
>>> with the number three (the Standard has 11 occurrences of *the Unicode
>>> encoding forms*, and 8 occurrences of *the three Unicode encoding forms*
>>> ).
>>>
>>> However, D79 quoted by Jens contradicts that usage.
>>>
>>> Le jeu. 9 févr. 2023 à 09:56, Jens Maurer <jens.maurer_at_[hidden]> a
>>> écrit :
>>>
>>>>
>>>> D79 A Unicode encoding form assigns each Unicode scalar value to a
>>>> unique code unit sequence.
>>>
>>>
>>> I have opened an issue for the Properties and Algorithms Group
>>> <https://www.unicode.org/consortium/props-algorithms.html> (reporting
>>> to the Unicode Technical Committee) to look into this.
>>>
>>
>> Thank you. We look forward to harmonization between D79 and the usage in
>> UTR #17.
>>
>> -- HT
>>
>>
>>>
>>>
>>>> So, any rule that maps Unicode scalar values to a unique code point
>>>> sequence is a Unicode encoding form. This certainly includes
>>>> UTF-8, UTF-16, and UTF-32, but it also includes CESU-8.
>>>
>>> (Aside, the above should say *to a unique code unit sequence*, and it
>>> is not clear to me that CESU-8 should be seen as an encoding form with
>>> 8-bit code units and a trivial encoding scheme, rather than an encoding
>>> scheme for the UTF-16 encoding form; the title of UTR #26 suggests the
>>> latter.)
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>

Received on 2023-02-24 10:41:52