C++ Logo

sg16

Advanced search

Re: Term for "UTF-8, UTF-16 and UTF-32"

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 8 Feb 2023 17:59:51 -0800
Thank you for your quick reply.
Does that mean that CESU-8 is not "a Unicode encoding form"? ie we want to
make sure to filter out conforming-but-not-specified-in-Unicode encodings.

"supports" in the wording you quoted is somewhat ambiguous, it could
arguably mean either "admits the existence of" or "these are the encodings
in the standard but there may be others", so we weren't quite sure.

Thanks!

On Wed, Feb 8, 2023, 17:39 Robin Leroy <egg.robin.leroy_at_[hidden]> wrote:

> Dear Corentin,
>
> I think you want to refer to *the Unicode encoding forms*.
> See, for instance:
> The Unicode Standard, Section 3.9, Unicode Encoding Forms
> <http://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G7404>:
>
>> The Unicode Standard supports three character encoding forms: UTF-32,
>> UTF-16, and UTF-8.
>
> Unicode Technical Report #17, Unicode Character Encoding Model, Section
> 5 Character Encoding Scheme (CES):
> <https://www.unicode.org/reports/tr17/#CharacterEncodingScheme>
>
>> Some of the Unicode encoding schemes have the same labels as the three
>> Unicode encoding forms.
>
>
> Note that *Unicode encodings specified in the Unicode standard* is a
> little bit ambiguous, because Unicode distinguishes the encoding *forms* (code
> points to code units) from the encoding *schemes* (code units to bytes;
> the Unicode Standard supports seven encoding schemes, with LE/BE/BOM for 16
> and 32). Assuming that the context here is [format.string.escaped] in
> document P2736, it looks like you are indeed dealing with the
> interpretation of code units (represented by the types char8_t, char16_t,
> and char32_t, per [lex.string.literal] referenced in
> [format.string.escaped]), and thus with encoding *forms*.
>
> Best regards,
>
> Robin Leroy
>
> Le mer. 8 févr. 2023 à 00:32, Corentin <corentin.jabot_at_[hidden]> a
> écrit :
>
>> Hey Robin,
>> How are you?
>>
>> Does Unicode have a term to designate "UTF-8, UTF-16 and UTF-32", i.e.
>> Unicode encodings specified in the Unicode standard - excluding things like
>> CESU-8 for example?
>> It's something we would find useful in the C++ specification
>>
>> Thanks,
>>
>> Corentin
>>
>>

Received on 2023-02-09 02:00:06