Date: Fri, 15 Oct 2021 22:29:20 +0200
On 15/10/2021 21.13, Tom Honermann wrote:
> The following is my attempt to describe the concerns we're trying to balance here.
>
> First, some informal terminology. For the following, consider the text "ð𑄣" consisting of two characters denoted by the Unicode scalar values U+00F0 and U+11123 respectively.
>
> * Encoding Form: An encoding of a sequence of characters as a sequence of code points. In UTF-16, the above text is encoded as the sequence of 3 16-bit code points { { 0x00F0 }, { 0xD804, 0xDD23 } }.
This understanding of "encoding form" does not match the ISO 10646 definition (clause 10).
Quote:
"This document provides three encoding forms expressing each UCS scalar value
in a unique sequence of one or more code units. These are named UTF-8, UTF-16,
and UTF-32 respectively."
Thus, an encoding form maps a UCS scalar value to a sequence of code units.
You are incorrectly stating that an encoding form produces code points as
output.
(A code point is approximately the same as a UCS scalar value, which is
the input (not the output) of the "encoding form" mapping.)
> * Encoding Scheme: An encoding of a sequence of characters as a sequence of endian dependent code units.
> o In UTF-16BE, the above text is encoded as the sequence of 6 8-bit code units { { 0x00, 0xF0 }, { 0xD8, 0x04, 0xDD, 0x23 } }.
> o In UTF-16LE, the above text is encoded as the sequence of 6 8-bit code units { { 0xF0, 0x00 }, { 0x04, 0xD8, 0x23, 0xDD } }.
The use of "code units" is confused here.
Quote from ISO 10646 clause 11:
"Encoding schemes are octet serializations specific to each UCS encoding form, ..."
So, the encoding scheme adds an octet serialization on top of an encoding form.
The output of an encoding scheme is thus a sequence of octets.
> * Encoding forms and encoding schemes are related; given encoding X, the encoding scheme of X is an encoding of the sequence of code points of the encoding form of X into a sequence of code units.
>
> Next, some assertions that I expect to be uncontroversial.
>
> * Bytes are not octets; they are >= 8 bits.
> * The number of bits in a byte is implementation-defined and exposed via the CHAR_BIT macro.
> * sizeof(char) is always 1 and therefore always 1 byte.
> * sizeof(wchar_t) is >= 1 and therefore 1 or more bytes.
> * Both Unicode and IANA restrict encoding schemes to
The last bullet appears to be truncated.
> Implementations with the following implementation-defined characteristics are common:
>
> * CHAR_BIT=8 sizeof(wchar_t) == 2
> * CHAR_BIT=8 sizeof(wchar_t) == 4
>
> Implementations with the following implementation-defined characteristics are not common, but are known to exist. I don't know what encodings are used for character and string literals for these cases.
>
> * CHAR_BIT=16 sizeof(wchar_t) == 1 (Texas Instruments cl54 C++ compiler)
> * CHAR_BIT=16 sizeof(wchar_t) == 2 (CEVA C++ compiler)
> * CHAR_BIT=32 sizeof(wchar_t) == 1 (Analog Devices C++ compiler)
>
> There are arguably five variants of each of UTF-16, UTF-32, UCS-2, and UCS-4:
>
> 1. The encoding form that produces a sequence of 16-bit code points (UTF-16, UCS-2) or 32-bit code points (UTF-32, UCS-4).
> 2. The big-endian encoding scheme that produces a sequence of 8-bit code units.
> 3. The little-endian encoding scheme that produces a sequence of 8-bit code units.
> 4. The native-endian encoding scheme in which the endianness is specified as either big or little depending on platform.
This one does not exist in ISO 10646.
> 5. The encoding scheme in which the endianness is determined by a leading BOM character (with a default otherwise; usually big-endian).
>
> IANA does not provide encoding identifiers that enable distinguishing between the five variants listed above.
IANA defines encoding schemes, not encoding forms, thus #1 is out-of-scope for
IANA. The other three variants defined by Unicode are actually represented
for both UTF-16 and UTF-32 (but not for UCS-2 and UCS-4).
> So, we have to decide how to map the IANA encoding identifiers to our intended uses. IANA intends to identify encoding schemes and provides the following identifiers for the encodings mentioned above. The numbers correspond to the numbered variant above.
>
> * UTF-16BE (#2)
> * UTF-16LE (#3)
> * UTF-16 (#5)
> * UTF-32BE (#2)
> * UTF-32LE (#3)
> * UTF-32 (#5)
> * ISO-10646-UCS-2 (#2, #3, #5; endianness is effectively unspecified)
> * ISO-10646-UCS-4 (#2, #3, #5; endianness is effectively unspecified)
>
> Fortunately, we can mostly ignore the UCS-2 and UCS-4 cases as being obsolete.
Does that mean we should simply exclude these cases from the set of possible
return values for the _literal() functions? If we don't, we should give
guidance what implementations on such platforms should do.
> The text_encoding type provided by P1885 is intended to serve multiple use cases. Examples include discovering how literals are encoded, associating an encoding with a file or a network stream, and communicating an encoding to a conversion facility such as iconv(). In the case of string literals, there is an inherent conflict with whether an encoding form or encoding scheme is desired.
>
> Consider an implementation where sizeof(wchar_t) == 2 and wide literals are encoded in a UTF-16 encoding scheme. The elements of a wide literal string are 16-bit code points encoded in either big-endian or little-endian order across 2 bytes. It would therefore make sense for wide_literal() to return either UTF-16BE or UTF-16LE. However, programmers usually interact with wide strings at the encoding form level, so they may expect UTF-16 with an interpretation matching variant #1 or #4 above.
>
> Now consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 16, and wide literals are encoded in the UTF-16 encoding form. In this case, none of the encoding schemes apply. Programmers are likely to expect wide_literal() to return UTF-16 with an interpretation matching variant #1 above.
>
> Finally, consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 8, and wide literals are encoded in a UTF-16 encoding scheme. It has been argued that this configuration would violate [lex.charset]p3 <http://eel.is/c++draft/lex.charset#3> due to the presence of 0-valued elements that don't correspond to the null character. However, if this configuration was conforming, then wide_literal() might be expected to return UTF-16BE or UTF-16LE; UTF-16 would be surprising since there are no BOM implications and the endianness is well known and relevant when accessing string elements.
>
> The situation is that, for wide strings:
>
> * Programmers are likely more interested in encoding form than encoding scheme.
> * An encoding scheme may not be relevant (as in the sizeof(wchar_t) == 1, CHAR_BIT == 16 scenario).
>
> The SG16 compromise was to re-purpose IANA's UTF-16 identifier to simultaneously imply a UTF-16 encoding form (for the elements of the string) and, if an encoding scheme is relevant, that the encoding scheme is the native endianness of the wchar_t type. Likewise for UTF-32.
That's sort-of fine, but there are other wide encodings (e.g. wide EBCDIC encodings)
that would likely need similar treatment.
(As a side note, it seems odd that we're keen on using IANA (i.e. a list of
encoding schemes) for wide_literal(), but then we make every effort to read
this as an encoding form.)
> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>
> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>
> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
case.
Jens
> The following is my attempt to describe the concerns we're trying to balance here.
>
> First, some informal terminology. For the following, consider the text "ð𑄣" consisting of two characters denoted by the Unicode scalar values U+00F0 and U+11123 respectively.
>
> * Encoding Form: An encoding of a sequence of characters as a sequence of code points. In UTF-16, the above text is encoded as the sequence of 3 16-bit code points { { 0x00F0 }, { 0xD804, 0xDD23 } }.
This understanding of "encoding form" does not match the ISO 10646 definition (clause 10).
Quote:
"This document provides three encoding forms expressing each UCS scalar value
in a unique sequence of one or more code units. These are named UTF-8, UTF-16,
and UTF-32 respectively."
Thus, an encoding form maps a UCS scalar value to a sequence of code units.
You are incorrectly stating that an encoding form produces code points as
output.
(A code point is approximately the same as a UCS scalar value, which is
the input (not the output) of the "encoding form" mapping.)
> * Encoding Scheme: An encoding of a sequence of characters as a sequence of endian dependent code units.
> o In UTF-16BE, the above text is encoded as the sequence of 6 8-bit code units { { 0x00, 0xF0 }, { 0xD8, 0x04, 0xDD, 0x23 } }.
> o In UTF-16LE, the above text is encoded as the sequence of 6 8-bit code units { { 0xF0, 0x00 }, { 0x04, 0xD8, 0x23, 0xDD } }.
The use of "code units" is confused here.
Quote from ISO 10646 clause 11:
"Encoding schemes are octet serializations specific to each UCS encoding form, ..."
So, the encoding scheme adds an octet serialization on top of an encoding form.
The output of an encoding scheme is thus a sequence of octets.
> * Encoding forms and encoding schemes are related; given encoding X, the encoding scheme of X is an encoding of the sequence of code points of the encoding form of X into a sequence of code units.
>
> Next, some assertions that I expect to be uncontroversial.
>
> * Bytes are not octets; they are >= 8 bits.
> * The number of bits in a byte is implementation-defined and exposed via the CHAR_BIT macro.
> * sizeof(char) is always 1 and therefore always 1 byte.
> * sizeof(wchar_t) is >= 1 and therefore 1 or more bytes.
> * Both Unicode and IANA restrict encoding schemes to
The last bullet appears to be truncated.
> Implementations with the following implementation-defined characteristics are common:
>
> * CHAR_BIT=8 sizeof(wchar_t) == 2
> * CHAR_BIT=8 sizeof(wchar_t) == 4
>
> Implementations with the following implementation-defined characteristics are not common, but are known to exist. I don't know what encodings are used for character and string literals for these cases.
>
> * CHAR_BIT=16 sizeof(wchar_t) == 1 (Texas Instruments cl54 C++ compiler)
> * CHAR_BIT=16 sizeof(wchar_t) == 2 (CEVA C++ compiler)
> * CHAR_BIT=32 sizeof(wchar_t) == 1 (Analog Devices C++ compiler)
>
> There are arguably five variants of each of UTF-16, UTF-32, UCS-2, and UCS-4:
>
> 1. The encoding form that produces a sequence of 16-bit code points (UTF-16, UCS-2) or 32-bit code points (UTF-32, UCS-4).
> 2. The big-endian encoding scheme that produces a sequence of 8-bit code units.
> 3. The little-endian encoding scheme that produces a sequence of 8-bit code units.
> 4. The native-endian encoding scheme in which the endianness is specified as either big or little depending on platform.
This one does not exist in ISO 10646.
> 5. The encoding scheme in which the endianness is determined by a leading BOM character (with a default otherwise; usually big-endian).
>
> IANA does not provide encoding identifiers that enable distinguishing between the five variants listed above.
IANA defines encoding schemes, not encoding forms, thus #1 is out-of-scope for
IANA. The other three variants defined by Unicode are actually represented
for both UTF-16 and UTF-32 (but not for UCS-2 and UCS-4).
> So, we have to decide how to map the IANA encoding identifiers to our intended uses. IANA intends to identify encoding schemes and provides the following identifiers for the encodings mentioned above. The numbers correspond to the numbered variant above.
>
> * UTF-16BE (#2)
> * UTF-16LE (#3)
> * UTF-16 (#5)
> * UTF-32BE (#2)
> * UTF-32LE (#3)
> * UTF-32 (#5)
> * ISO-10646-UCS-2 (#2, #3, #5; endianness is effectively unspecified)
> * ISO-10646-UCS-4 (#2, #3, #5; endianness is effectively unspecified)
>
> Fortunately, we can mostly ignore the UCS-2 and UCS-4 cases as being obsolete.
Does that mean we should simply exclude these cases from the set of possible
return values for the _literal() functions? If we don't, we should give
guidance what implementations on such platforms should do.
> The text_encoding type provided by P1885 is intended to serve multiple use cases. Examples include discovering how literals are encoded, associating an encoding with a file or a network stream, and communicating an encoding to a conversion facility such as iconv(). In the case of string literals, there is an inherent conflict with whether an encoding form or encoding scheme is desired.
>
> Consider an implementation where sizeof(wchar_t) == 2 and wide literals are encoded in a UTF-16 encoding scheme. The elements of a wide literal string are 16-bit code points encoded in either big-endian or little-endian order across 2 bytes. It would therefore make sense for wide_literal() to return either UTF-16BE or UTF-16LE. However, programmers usually interact with wide strings at the encoding form level, so they may expect UTF-16 with an interpretation matching variant #1 or #4 above.
>
> Now consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 16, and wide literals are encoded in the UTF-16 encoding form. In this case, none of the encoding schemes apply. Programmers are likely to expect wide_literal() to return UTF-16 with an interpretation matching variant #1 above.
>
> Finally, consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 8, and wide literals are encoded in a UTF-16 encoding scheme. It has been argued that this configuration would violate [lex.charset]p3 <http://eel.is/c++draft/lex.charset#3> due to the presence of 0-valued elements that don't correspond to the null character. However, if this configuration was conforming, then wide_literal() might be expected to return UTF-16BE or UTF-16LE; UTF-16 would be surprising since there are no BOM implications and the endianness is well known and relevant when accessing string elements.
>
> The situation is that, for wide strings:
>
> * Programmers are likely more interested in encoding form than encoding scheme.
> * An encoding scheme may not be relevant (as in the sizeof(wchar_t) == 1, CHAR_BIT == 16 scenario).
>
> The SG16 compromise was to re-purpose IANA's UTF-16 identifier to simultaneously imply a UTF-16 encoding form (for the elements of the string) and, if an encoding scheme is relevant, that the encoding scheme is the native endianness of the wchar_t type. Likewise for UTF-32.
That's sort-of fine, but there are other wide encodings (e.g. wide EBCDIC encodings)
that would likely need similar treatment.
(As a side note, it seems odd that we're keen on using IANA (i.e. a list of
encoding schemes) for wide_literal(), but then we make every effort to read
this as an encoding form.)
> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>
> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>
> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
case.
Jens
Received on 2021-10-15 15:29:28