C++ Logo


Advanced search

Re: Term for "UTF-8, UTF-16 and UTF-32"

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Thu, 9 Feb 2023 02:56:51 +0100
On 09/02/2023 02.39, Robin Leroy via SG16 wrote:
> Dear Corentin,
> I think you want to refer to /the Unicode encoding forms/.

No, that's not the right term, because its definition is not closed.

> See, for instance:
> The Unicode Standard, Section 3.9, Unicode Encoding Forms <http://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G7404>:

This says

D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit

So, any rule that maps Unicode scalar values to a unique code point
sequence is a Unicode encoding form. This certainly includes
UTF-8, UTF-16, and UTF-32, but it also includes CESU-8.

> The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8.

No disagreement here, but that's not the definition of
"Unicode encoding form".

> Unicode Technical Report #17, Unicode Character Encoding Model, Section 5 Character Encoding Scheme (CES): <https://www.unicode.org/reports/tr17/#CharacterEncodingScheme>
> Some of the Unicode encoding schemes have the same labels as the three Unicode encoding forms.
> Note that /Unicode encodings specified in the Unicode standard/ is a little bit ambiguous, because Unicode distinguishes the encoding /forms/ (code points to code units) from the encoding /schemes/ (code units to bytes; the Unicode Standard supports seven encoding schemes, with LE/BE/BOM for 16 and 32). Assuming that the context here is [format.string.escaped] in document P2736, it looks like you are indeed dealing with the interpretation of code units (represented by the types char8_t, char16_t, and char32_t, per [lex.string.literal] referenced in [format.string.escaped]), and thus with encoding /forms/.

Yes. So, a valid description for the (closed) set UTF-8, UTF-16, UTF-32
would be

"The Unicode encoding forms specified in the Unicode standard"

but that's actually quite a mouthful and longer than "UTF-8, UTF-16, UTF-32".
Except that the latter list is ambiguous regarding encoding form/encoding scheme,
which is not great in itself.


Received on 2023-02-09 01:56:57