C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 15 Oct 2021 17:25:05 -0400
On 10/15/21 4:29 PM, Jens Maurer via Lib-Ext wrote:
> On 15/10/2021 21.13, Tom Honermann wrote:
>> The following is my attempt to describe the concerns we're trying to balance here.
>>
>> First, some informal terminology. For the following, consider the text "ð𑄣" consisting of two characters denoted by the Unicode scalar values U+00F0 and U+11123 respectively.
>>
>> * Encoding Form: An encoding of a sequence of characters as a sequence of code points. In UTF-16, the above text is encoded as the sequence of 3 16-bit code points { { 0x00F0 }, { 0xD804, 0xDD23 } }.
> This understanding of "encoding form" does not match the ISO 10646 definition (clause 10).
>
> Quote:
>
> "This document provides three encoding forms expressing each UCS scalar value
> in a unique sequence of one or more code units. These are named UTF-8, UTF-16,
> and UTF-32 respectively."
>
> Thus, an encoding form maps a UCS scalar value to a sequence of code units.
> You are incorrectly stating that an encoding form produces code points as
> output.
> (A code point is approximately the same as a UCS scalar value, which is
> the input (not the output) of the "encoding form" mapping.)

You are right of course; I let myself get too informal; I should have
just quoted the definitions.

Replace "sequence of characters" with "sequence of UCS scalar values"
and "code points" with "code units" in my definition above.

>
>> * Encoding Scheme: An encoding of a sequence of characters as a sequence of endian dependent code units.
>> o In UTF-16BE, the above text is encoded as the sequence of 6 8-bit code units { { 0x00, 0xF0 }, { 0xD8, 0x04, 0xDD, 0x23 } }.
>> o In UTF-16LE, the above text is encoded as the sequence of 6 8-bit code units { { 0xF0, 0x00 }, { 0x04, 0xD8, 0x23, 0xDD } }.
> The use of "code units" is confused here.
>
> Quote from ISO 10646 clause 11:
>
> "Encoding schemes are octet serializations specific to each UCS encoding form, ..."
>
> So, the encoding scheme adds an octet serialization on top of an encoding form.
> The output of an encoding scheme is thus a sequence of octets.

Yes. Replace "sequence of characters" with "sequence of code units" and
"code units" with "bytes" or "octets".

>
>> * Encoding forms and encoding schemes are related; given encoding X, the encoding scheme of X is an encoding of the sequence of code points of the encoding form of X into a sequence of code units.
>>
>> Next, some assertions that I expect to be uncontroversial.
>>
>> * Bytes are not octets; they are >= 8 bits.
>> * The number of bits in a byte is implementation-defined and exposed via the CHAR_BIT macro.
>> * sizeof(char) is always 1 and therefore always 1 byte.
>> * sizeof(wchar_t) is >= 1 and therefore 1 or more bytes.
>> * Both Unicode and IANA restrict encoding schemes to
> The last bullet appears to be truncated.

Ugh. It's Friday and apparently I've already left for the weekend.

I intended that to state that both Unicode and IANA restrict encoding
schemes to sequences of 8-bit bytes/octets.

>
>> Implementations with the following implementation-defined characteristics are common:
>>
>> * CHAR_BIT=8 sizeof(wchar_t) == 2
>> * CHAR_BIT=8 sizeof(wchar_t) == 4
>>
>> Implementations with the following implementation-defined characteristics are not common, but are known to exist. I don't know what encodings are used for character and string literals for these cases.
>>
>> * CHAR_BIT=16 sizeof(wchar_t) == 1 (Texas Instruments cl54 C++ compiler)
>> * CHAR_BIT=16 sizeof(wchar_t) == 2 (CEVA C++ compiler)
>> * CHAR_BIT=32 sizeof(wchar_t) == 1 (Analog Devices C++ compiler)
>>
>> There are arguably five variants of each of UTF-16, UTF-32, UCS-2, and UCS-4:
>>
>> 1. The encoding form that produces a sequence of 16-bit code points (UTF-16, UCS-2) or 32-bit code points (UTF-32, UCS-4).
>> 2. The big-endian encoding scheme that produces a sequence of 8-bit code units.
>> 3. The little-endian encoding scheme that produces a sequence of 8-bit code units.
>> 4. The native-endian encoding scheme in which the endianness is specified as either big or little depending on platform.
> This one does not exist in ISO 10646.
Correct. I added it because the SG16 consensus design requires the
notion of a native-endian encoding scheme.
>
>> 5. The encoding scheme in which the endianness is determined by a leading BOM character (with a default otherwise; usually big-endian).
>>
>> IANA does not provide encoding identifiers that enable distinguishing between the five variants listed above.
> IANA defines encoding schemes, not encoding forms, thus #1 is out-of-scope for
> IANA. The other three variants defined by Unicode are actually represented
> for both UTF-16 and UTF-32 (but not for UCS-2 and UCS-4).
Correct (as reflected in the list mapping them below).
>
>> So, we have to decide how to map the IANA encoding identifiers to our intended uses. IANA intends to identify encoding schemes and provides the following identifiers for the encodings mentioned above. The numbers correspond to the numbered variant above.
>>
>> * UTF-16BE (#2)
>> * UTF-16LE (#3)
>> * UTF-16 (#5)
>> * UTF-32BE (#2)
>> * UTF-32LE (#3)
>> * UTF-32 (#5)
>> * ISO-10646-UCS-2 (#2, #3, #5; endianness is effectively unspecified)
>> * ISO-10646-UCS-4 (#2, #3, #5; endianness is effectively unspecified)
>>
>> Fortunately, we can mostly ignore the UCS-2 and UCS-4 cases as being obsolete.
> Does that mean we should simply exclude these cases from the set of possible
> return values for the _literal() functions? If we don't, we should give
> guidance what implementations on such platforms should do.
I think excluding them would be appropriate, but if we don't, I agree
guidance would be useful.
>
>> The text_encoding type provided by P1885 is intended to serve multiple use cases. Examples include discovering how literals are encoded, associating an encoding with a file or a network stream, and communicating an encoding to a conversion facility such as iconv(). In the case of string literals, there is an inherent conflict with whether an encoding form or encoding scheme is desired.
>>
>> Consider an implementation where sizeof(wchar_t) == 2 and wide literals are encoded in a UTF-16 encoding scheme. The elements of a wide literal string are 16-bit code points encoded in either big-endian or little-endian order across 2 bytes. It would therefore make sense for wide_literal() to return either UTF-16BE or UTF-16LE. However, programmers usually interact with wide strings at the encoding form level, so they may expect UTF-16 with an interpretation matching variant #1 or #4 above.
>>
>> Now consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 16, and wide literals are encoded in the UTF-16 encoding form. In this case, none of the encoding schemes apply. Programmers are likely to expect wide_literal() to return UTF-16 with an interpretation matching variant #1 above.
>>
>> Finally, consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 8, and wide literals are encoded in a UTF-16 encoding scheme. It has been argued that this configuration would violate [lex.charset]p3 <http://eel.is/c++draft/lex.charset#3> due to the presence of 0-valued elements that don't correspond to the null character. However, if this configuration was conforming, then wide_literal() might be expected to return UTF-16BE or UTF-16LE; UTF-16 would be surprising since there are no BOM implications and the endianness is well known and relevant when accessing string elements.
>>
>> The situation is that, for wide strings:
>>
>> * Programmers are likely more interested in encoding form than encoding scheme.
>> * An encoding scheme may not be relevant (as in the sizeof(wchar_t) == 1, CHAR_BIT == 16 scenario).
>>
>> The SG16 compromise was to re-purpose IANA's UTF-16 identifier to simultaneously imply a UTF-16 encoding form (for the elements of the string) and, if an encoding scheme is relevant, that the encoding scheme is the native endianness of the wchar_t type. Likewise for UTF-32.
> That's sort-of fine, but there are other wide encodings (e.g. wide EBCDIC encodings)
> that would likely need similar treatment.

Yes, but unless I'm mistaken, IANA does not currently specify any wide
EBCDIC encodings. We can (and probably should) add some guidance in the
prose of the paper based on IBM documentation, but I'm otherwise unaware
of how normative guidance could be provided.

>
> (As a side note, it seems odd that we're keen on using IANA (i.e. a list of
> encoding schemes) for wide_literal(), but then we make every effort to read
> this as an encoding form.)

Indeed, a consequence of trying to balance available specifications,
utility, and programmer expectations.

>> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>>
>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>>
>> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
> I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
> and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
> case.

Is this strictly a wording concern? Or do you find the design intent to
be unclear? (I think you may have intended CHAR_BIT == 8 for the second
case, though it doesn't really matter).

Tom.

>
> Jens
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/10/20951.php

Received on 2021-10-15 16:25:14