C++ Logo


Advanced search

Re: Term for "UTF-8, UTF-16 and UTF-32"

From: Steven R. Loomis <srl295_at_[hidden]>
Date: Mon, 24 Apr 2023 08:56:30 -0500

> ISO/IEC 10646 and the Unicode Standard define the UTF-8 encoding form, which is very similar in definition to CESU-8 other than its treatment of supplementary characters. CESU-8 is a different encoding scheme. It does not form part of either ISO/IEC 10646 or the Unicode Standard. It is intended only for use in compatibility situations where binary collation with UTF-16 is required.

From this I’d think that CESU-8 is not a Unicode Encoding Standard.

Steven R. Loomis
Code Hive Tx, LLC
> On Feb 8, 2023, at 7:59 PM, Corentin via SG16 <sg16_at_[hidden]> wrote:
> Thank you for your quick reply.
> Does that mean that CESU-8 is not "a Unicode encoding form"? ie we want to make sure to filter out conforming-but-not-specified-in-Unicode encodings.
> "supports" in the wording you quoted is somewhat ambiguous, it could arguably mean either "admits the existence of" or "these are the encodings in the standard but there may be others", so we weren't quite sure.
> Thanks!
> On Wed, Feb 8, 2023, 17:39 Robin Leroy <egg.robin.leroy_at_[hidden]> wrote:
> Dear Corentin,
> I think you want to refer to the Unicode encoding forms.
> See, for instance:
> The Unicode Standard, Section 3.9, Unicode Encoding Forms:
> The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8.
> Unicode Technical Report #17, Unicode Character Encoding Model, Section 5 Character Encoding Scheme (CES):
> Some of the Unicode encoding schemes have the same labels as the three Unicode encoding forms. 
> Note that Unicode encodings specified in the Unicode standard is a little bit ambiguous, because Unicode distinguishes the encoding forms (code points to code units) from the encoding schemes (code units to bytes; the Unicode Standard supports seven encoding schemes, with LE/BE/BOM for 16 and 32). Assuming that the context here is [format.string.escaped] in document P2736, it looks like you are indeed dealing with the interpretation of code units (represented by the types char8_t, char16_t, and char32_t, per [lex.string.literal] referenced in [format.string.escaped]), and thus with encoding forms.
> Best regards,
> Robin Leroy
> Le mer. 8 févr. 2023 à 00:32, Corentin <corentin.jabot_at_[hidden]> a écrit :
> Hey Robin,
> How are you?
> Does Unicode have a term to designate "UTF-8, UTF-16 and UTF-32", i.e. Unicode encodings specified in the Unicode standard - excluding things like CESU-8 for example?
> It's something we would find useful in the C++ specification
> Thanks,
> Corentin
> -- 
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2023-04-24 13:56:43