On 10/15/21 9:42 AM, Jens Maurer via Lib-Ext wrote:
On 15/10/2021 15.20, Corentin wrote:

On Fri, Oct 15, 2021 at 1:14 PM Ville Voutilainen <ville.voutilainen@gmail.com <mailto:ville.voutilainen@gmail.com>> wrote:

    On Fri, 15 Oct 2021 at 13:55, Jens Maurer via Lib-Ext
    <lib-ext@lists.isocpp.org <mailto:lib-ext@lists.isocpp.org>> wrote:
    >
    > On 15/10/2021 12.41, Bryce Adelstein Lelbach aka wash wrote:
    > > Jens, this does not sound like a library design matter.
    > >
    > > Can we please stop holding this paper up in LEWG unless there are library design questions?
    > > If there are questions about the specifics of wording or text/Unicode details, there are groups that can deal with that (LWG and SG16).
    > > Just because LEWG says we approve this paper does not mean it automatically goes into the standard, it just means we are happy with the library design.
    >
    > I am raising concerns I have about the current state of the paper.
    >
    > If the chair of LEWG deems those concerns not to be relevant at the
    > level of LEWG, I'm fine with that, and I'll raise them again in LWG
    > and/or plenary, as need be.

    The following bit has a design question in it:

    > >     The paper is missing a normative definition of "encoding scheme"
    > >     with particular attention to the fact that an octet is not a
    > >     C++ byte.  From such a definition, I would hope to gain clarity
    > >     how UTF-16 should be handled on a platform with CHAR_BITS == 16.


I was not expecting this to be up for polling this week, and I have limited time, 
but the intent of SG16 was made clear. I asked SG16 last time if they had further concerns with the design and they did not.
The intent/model chosen is "can be reinterpreted to char*, fed to iconv and iconv will do something sensible.
Unfortunately that does not help with the char_bits=16 case, as we do not have existing practices with text library on systems with char_bits !=8.
The definition of encoding scheme used is independent of char_bits, as long as the bit pattern of the object representation is consistent with the specification of an encoding.

Jens, did you raise that question at the last SG16 meeting?
The last SG16 meeting did not leave lots of room for discussion after
the two presentations. I focused on getting the "object representation"
model understood. (Maybe I'm stupid, but given the absence of any mention
of "object representation" in the previous incarnations of the paper, that
model was news to me.)

The CHAR_BIT == 16 question was brought up by Tom Honermann in
"[SG16] Agenda for the 2021-10-06 SG16 telecon" on 2021-10-01.
In particular, the question is whether a platform where sizeof(wchar_t) == 1
and CHAR_BIT == 16 should return UTF16 for wide_literal()
(if that's the encoding they use).
Similarly, if sizeof(wchar_t) == 2 on such a CHAR_BIT == 16 platform,
is UTF16 the expected return value if the UTF16 code units are dispersed
across two chars?  (First 8 bits in the first 16-bit char, second 8 bits
in the second 16-bit char.)

I don't think I stayed to the end of the last SG16 meeting (late hour here);
sorry if those questions were discussed. Please point to the wording that
addresses these questions.

The email thread that Jens alluded to is available here.

I'll put discussion and clarification of this on the agenda for the SG16 telecon Wednesday of next week (2021-10-20). I don't think Corentin will be able to attend, but I think these concerns are well enough understood that we can discuss and verify consensus. At this time, I still consider these detail level questions that do not substantially impact design.

The following is my attempt to describe the concerns we're trying to balance here.

First, some informal terminology. For the following, consider the text "ð𑄣" consisting of two characters denoted by the Unicode scalar values U+00F0 and U+11123 respectively.

Next, some assertions that I expect to be uncontroversial.

Implementations with the following implementation-defined characteristics are common:

Implementations with the following implementation-defined characteristics are not common, but are known to exist. I don't know what encodings are used for character and string literals for these cases.

There are arguably five variants of each of UTF-16, UTF-32, UCS-2, and UCS-4:

  1. The encoding form that produces a sequence of 16-bit code points (UTF-16, UCS-2) or 32-bit code points (UTF-32, UCS-4).
  2. The big-endian encoding scheme that produces a sequence of 8-bit code units.
  3. The little-endian encoding scheme that produces a sequence of 8-bit code units.
  4. The native-endian encoding scheme in which the endianness is specified as either big or little depending on platform.
  5. The encoding scheme in which the endianness is determined by a leading BOM character (with a default otherwise; usually big-endian).

IANA does not provide encoding identifiers that enable distinguishing between the five variants listed above. So, we have to decide how to map the IANA encoding identifiers to our intended uses. IANA intends to identify encoding schemes and provides the following identifiers for the encodings mentioned above. The numbers correspond to the numbered variant above.

Fortunately, we can mostly ignore the UCS-2 and UCS-4 cases as being obsolete.

The text_encoding type provided by P1885 is intended to serve multiple use cases. Examples include discovering how literals are encoded, associating an encoding with a file or a network stream, and communicating an encoding to a conversion facility such as iconv(). In the case of string literals, there is an inherent conflict with whether an encoding form or encoding scheme is desired.

Consider an implementation where sizeof(wchar_t) == 2 and wide literals are encoded in a UTF-16 encoding scheme. The elements of a wide literal string are 16-bit code points encoded in either big-endian or little-endian order across 2 bytes. It would therefore make sense for wide_literal() to return either UTF-16BE or UTF-16LE. However, programmers usually interact with wide strings at the encoding form level, so they may expect UTF-16 with an interpretation matching variant #1 or #4 above.

Now consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 16, and wide literals are encoded in the UTF-16 encoding form. In this case, none of the encoding schemes apply. Programmers are likely to expect wide_literal() to return UTF-16 with an interpretation matching variant #1 above.

Finally, consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT == 8, and wide literals are encoded in a UTF-16 encoding scheme. It has been argued that this configuration would violate [lex.charset]p3 due to the presence of 0-valued elements that don't correspond to the null character. However, if this configuration was conforming, then wide_literal() might be expected to return UTF-16BE or UTF-16LE; UTF-16 would be surprising since there are no BOM implications and the endianness is well known and relevant when accessing string elements.

The situation is that, for wide strings:

The SG16 compromise was to re-purpose IANA's UTF-16 identifier to simultaneously imply a UTF-16 encoding form (for the elements of the string) and, if an encoding scheme is relevant, that the encoding scheme is the native endianness of the wchar_t type. Likewise for UTF-32.

One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:

It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.

Please offer corrections for anything I got wrong or failed to address above.

Tom.

Jens
_______________________________________________
Lib-Ext mailing list
Lib-Ext@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
Link to this post: http://lists.isocpp.org/lib-ext/2021/10/20929.php