sg16: Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 15 Oct 2021 15:13:22 -0400

On 10/15/21 9:42 AM, Jens Maurer via Lib-Ext wrote:
> On 15/10/2021 15.20, Corentin wrote:
>>
>> On Fri, Oct 15, 2021 at 1:14 PM Ville Voutilainen <ville.voutilainen_at_[hidden] <mailto:ville.voutilainen_at_[hidden]>> wrote:
>>
>> On Fri, 15 Oct 2021 at 13:55, Jens Maurer via Lib-Ext
>> <lib-ext_at_[hidden] <mailto:lib-ext_at_[hidden]>> wrote:
>> >
>> > On 15/10/2021 12.41, Bryce Adelstein Lelbach aka wash wrote:
>> > > Jens, this does not sound like a library design matter.
>> > >
>> > > Can we please stop holding this paper up in LEWG unless there are library design questions?
>> > > If there are questions about the specifics of wording or text/Unicode details, there are groups that can deal with that (LWG and SG16).
>> > > Just because LEWG says we approve this paper does not mean it automatically goes into the standard, it just means we are happy with the library design.
>> >
>> > I am raising concerns I have about the current state of the paper.
>> >
>> > If the chair of LEWG deems those concerns not to be relevant at the
>> > level of LEWG, I'm fine with that, and I'll raise them again in LWG
>> > and/or plenary, as need be.
>>
>> The following bit has a design question in it:
>>
>> > > The paper is missing a normative definition of "encoding scheme"
>> > > with particular attention to the fact that an octet is not a
>> > > C++ byte. From such a definition, I would hope to gain clarity
>> > > how UTF-16 should be handled on a platform with CHAR_BITS == 16.
>>
>>
>> I was not expecting this to be up for polling this week, and I have limited time,
>> but the intent of SG16 was made clear. I asked SG16 last time if they had further concerns with the design and they did not.
>> The intent/model chosen is "can be reinterpreted to char*, fed to iconv and iconv will do something sensible.
>> Unfortunately that does not help with the char_bits=16 case, as we do not have existing practices with text library on systems with char_bits !=8.
>> The definition of encoding scheme used is independent of char_bits, as long as the bit pattern of the object representation is consistent with the specification of an encoding.
>>
>> Jens, did you raise that question at the last SG16 meeting?
> The last SG16 meeting did not leave lots of room for discussion after
> the two presentations. I focused on getting the "object representation"
> model understood. (Maybe I'm stupid, but given the absence of any mention
> of "object representation" in the previous incarnations of the paper, that
> model was news to me.)
>
> The CHAR_BIT == 16 question was brought up by Tom Honermann in
> "[SG16] Agenda for the 2021-10-06 SG16 telecon" on 2021-10-01.
> In particular, the question is whether a platform where sizeof(wchar_t) == 1
> and CHAR_BIT == 16 should return UTF16 for wide_literal()
> (if that's the encoding they use).
> Similarly, if sizeof(wchar_t) == 2 on such a CHAR_BIT == 16 platform,
> is UTF16 the expected return value if the UTF16 code units are dispersed
> across two chars? (First 8 bits in the first 16-bit char, second 8 bits
> in the second 16-bit char.)
>
> I don't think I stayed to the end of the last SG16 meeting (late hour here);
> sorry if those questions were discussed. Please point to the wording that
> addresses these questions.

The email thread that Jens alluded to is available here
<https://lists.isocpp.org/sg16/2021/10/2676.php>.

I'll put discussion and clarification of this on the agenda for the SG16
telecon Wednesday of next week (2021-10-20). I don't think Corentin will
be able to attend, but I think these concerns are well enough understood
that we can discuss and verify consensus. At this time, I still consider
these detail level questions that do not substantially impact design.

The following is my attempt to describe the concerns we're trying to
balance here.

First, some informal terminology. For the following, consider the text
"ð𑄣" consisting of two characters denoted by the Unicode scalar values
U+00F0 and U+11123 respectively.

  * Encoding Form: An encoding of a sequence of characters as a sequence
    of code points. In UTF-16, the above text is encoded as the sequence
    of 3 16-bit code points { { 0x00F0 }, { 0xD804, 0xDD23 } }.
  * Encoding Scheme: An encoding of a sequence of characters as a
    sequence of endian dependent code units.
      o In UTF-16BE, the above text is encoded as the sequence of 6
        8-bit code units { { 0x00, 0xF0 }, { 0xD8, 0x04, 0xDD, 0x23 } }.
      o In UTF-16LE, the above text is encoded as the sequence of 6
        8-bit code units { { 0xF0, 0x00 }, { 0x04, 0xD8, 0x23, 0xDD } }.
  * Encoding forms and encoding schemes are related; given encoding X,
    the encoding scheme of X is an encoding of the sequence of code
    points of the encoding form of X into a sequence of code units.

Next, some assertions that I expect to be uncontroversial.

  * Bytes are not octets; they are >= 8 bits.
  * The number of bits in a byte is implementation-defined and exposed
    via the CHAR_BIT macro.
  * sizeof(char) is always 1 and therefore always 1 byte.
  * sizeof(wchar_t) is >= 1 and therefore 1 or more bytes.
  * Both Unicode and IANA restrict encoding schemes to

Implementations with the following implementation-defined
characteristics are common:

  * CHAR_BIT=8 sizeof(wchar_t) == 2
  * CHAR_BIT=8 sizeof(wchar_t) == 4

Implementations with the following implementation-defined
characteristics are not common, but are known to exist. I don't know
what encodings are used for character and string literals for these cases.

  * CHAR_BIT=16 sizeof(wchar_t) == 1 (Texas Instruments cl54 C++ compiler)
  * CHAR_BIT=16 sizeof(wchar_t) == 2 (CEVA C++ compiler)
  * CHAR_BIT=32 sizeof(wchar_t) == 1 (Analog Devices C++ compiler)

There are arguably five variants of each of UTF-16, UTF-32, UCS-2, and
UCS-4:

1. The encoding form that produces a sequence of 16-bit code points
    (UTF-16, UCS-2) or 32-bit code points (UTF-32, UCS-4).
2. The big-endian encoding scheme that produces a sequence of 8-bit
    code units.
3. The little-endian encoding scheme that produces a sequence of 8-bit
    code units.
4. The native-endian encoding scheme in which the endianness is
    specified as either big or little depending on platform.
5. The encoding scheme in which the endianness is determined by a
    leading BOM character (with a default otherwise; usually big-endian).

IANA does not provide encoding identifiers that enable distinguishing
between the five variants listed above. So, we have to decide how to map
the IANA encoding identifiers to our intended uses. IANA intends to
identify encoding schemes and provides the following identifiers for the
encodings mentioned above. The numbers correspond to the numbered
variant above.

  * UTF-16BE (#2)
  * UTF-16LE (#3)
  * UTF-16 (#5)
  * UTF-32BE (#2)
  * UTF-32LE (#3)
  * UTF-32 (#5)
  * ISO-10646-UCS-2 (#2, #3, #5; endianness is effectively unspecified)
  * ISO-10646-UCS-4 (#2, #3, #5; endianness is effectively unspecified)

Fortunately, we can mostly ignore the UCS-2 and UCS-4 cases as being
obsolete.

The text_encoding type provided by P1885 is intended to serve multiple
use cases. Examples include discovering how literals are encoded,
associating an encoding with a file or a network stream, and
communicating an encoding to a conversion facility such as iconv(). In
the case of string literals, there is an inherent conflict with whether
an encoding form or encoding scheme is desired.

Consider an implementation where sizeof(wchar_t) == 2 and wide literals
are encoded in a UTF-16 encoding scheme. The elements of a wide literal
string are 16-bit code points encoded in either big-endian or
little-endian order across 2 bytes. It would therefore make sense for
wide_literal() to return either UTF-16BE or UTF-16LE. However,
programmers usually interact with wide strings at the encoding form
level, so they may expect UTF-16 with an interpretation matching variant
#1 or #4 above.

Now consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT ==
16, and wide literals are encoded in the UTF-16 encoding form. In this
case, none of the encoding schemes apply. Programmers are likely to
expect wide_literal() to return UTF-16 with an interpretation matching
variant #1 above.

Finally, consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT
== 8, and wide literals are encoded in a UTF-16 encoding scheme. It has
been argued that this configuration would violate [lex.charset]p3
<http://eel.is/c++draft/lex.charset#3> due to the presence of 0-valued
elements that don't correspond to the null character. However, if this
configuration was conforming, then wide_literal() might be expected to
return UTF-16BE or UTF-16LE; UTF-16 would be surprising since there are
no BOM implications and the endianness is well known and relevant when
accessing string elements.

The situation is that, for wide strings:

  * Programmers are likely more interested in encoding form than
    encoding scheme.
  * An encoding scheme may not be relevant (as in the sizeof(wchar_t) ==
    1, CHAR_BIT == 16 scenario).

The SG16 compromise was to re-purpose IANA's UTF-16 identifier to
simultaneously imply a UTF-16 encoding form (for the elements of the
string) and, if an encoding scheme is relevant, that the encoding scheme
is the native endianness of the wchar_t type. Likewise for UTF-32.

One of the intended guarantees is that, when sizeof(wchar_t) != 1, that
the underlying byte representation of a wide string literal match an
encoding scheme associated with the encoding form as indicated by
wide_literal(). For example:

  * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok;
    reinterpret_cast<const char*>(L"text") yields a sequence of bytes
    that constitutes valid UTF-16BE or UTF-16LE.
  * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid;
    reinterpret_cast<const char*>(L"text") yields a sequence of bytes
    that is not valid UTF-16BE or UTF-16LE (due to each code point being
    stored across 4 bytes instead of 2).

It may be that the paper would benefit from some updates to make this
more clear, but I don't have any specific suggestions at this time.

Please offer corrections for anything I got wrong or failed to address
above.

Tom.

> Jens
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/10/20929.php

Received on 2021-10-15 14:13:25