Date: Thu, 9 Feb 2023 02:56:51 +0100
On 09/02/2023 02.39, Robin Leroy via SG16 wrote:
> Dear Corentin,
>
> I think you want to refer to /the Unicode encoding forms/.
No, that's not the right term, because its definition is not closed.
> See, for instance:
> The Unicode Standard, Section 3.9, Unicode Encoding Forms <http://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G7404>:
This says
D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit
sequence.
So, any rule that maps Unicode scalar values to a unique code point
sequence is a Unicode encoding form. This certainly includes
UTF-8, UTF-16, and UTF-32, but it also includes CESU-8.
> The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8.
No disagreement here, but that's not the definition of
"Unicode encoding form".
> Unicode Technical Report #17, Unicode Character Encoding Model, Section 5 Character Encoding Scheme (CES): <https://www.unicode.org/reports/tr17/#CharacterEncodingScheme>
>
> Some of the Unicode encoding schemes have the same labels as the three Unicode encoding forms.
>
>
> Note that /Unicode encodings specified in the Unicode standard/ is a little bit ambiguous, because Unicode distinguishes the encoding /forms/ (code points to code units) from the encoding /schemes/ (code units to bytes; the Unicode Standard supports seven encoding schemes, with LE/BE/BOM for 16 and 32). Assuming that the context here is [format.string.escaped] in document P2736, it looks like you are indeed dealing with the interpretation of code units (represented by the types char8_t, char16_t, and char32_t, per [lex.string.literal] referenced in [format.string.escaped]), and thus with encoding /forms/.
Yes. So, a valid description for the (closed) set UTF-8, UTF-16, UTF-32
would be
"The Unicode encoding forms specified in the Unicode standard"
but that's actually quite a mouthful and longer than "UTF-8, UTF-16, UTF-32".
Except that the latter list is ambiguous regarding encoding form/encoding scheme,
which is not great in itself.
Jens
> Dear Corentin,
>
> I think you want to refer to /the Unicode encoding forms/.
No, that's not the right term, because its definition is not closed.
> See, for instance:
> The Unicode Standard, Section 3.9, Unicode Encoding Forms <http://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G7404>:
This says
D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit
sequence.
So, any rule that maps Unicode scalar values to a unique code point
sequence is a Unicode encoding form. This certainly includes
UTF-8, UTF-16, and UTF-32, but it also includes CESU-8.
> The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8.
No disagreement here, but that's not the definition of
"Unicode encoding form".
> Unicode Technical Report #17, Unicode Character Encoding Model, Section 5 Character Encoding Scheme (CES): <https://www.unicode.org/reports/tr17/#CharacterEncodingScheme>
>
> Some of the Unicode encoding schemes have the same labels as the three Unicode encoding forms.
>
>
> Note that /Unicode encodings specified in the Unicode standard/ is a little bit ambiguous, because Unicode distinguishes the encoding /forms/ (code points to code units) from the encoding /schemes/ (code units to bytes; the Unicode Standard supports seven encoding schemes, with LE/BE/BOM for 16 and 32). Assuming that the context here is [format.string.escaped] in document P2736, it looks like you are indeed dealing with the interpretation of code units (represented by the types char8_t, char16_t, and char32_t, per [lex.string.literal] referenced in [format.string.escaped]), and thus with encoding /forms/.
Yes. So, a valid description for the (closed) set UTF-8, UTF-16, UTF-32
would be
"The Unicode encoding forms specified in the Unicode standard"
but that's actually quite a mouthful and longer than "UTF-8, UTF-16, UTF-32".
Except that the latter list is ambiguous regarding encoding form/encoding scheme,
which is not great in itself.
Jens
Received on 2023-02-09 01:56:57