sg16: Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 18 Oct 2021 18:19:12 -0400

I think the change Corentin indicated would be a significant one; one
that I would want to run by SG16 again. If I understand his intent, it
would mean that an implementation that has CHAR_BIT != 8 would have to
return unknown for literal() and environment() as well as the wide
variants. Such a restriction does not strike me as necessary, nor has it
been discussed in SG16 or in LEWG.

I understand the reluctance to spend time on corner cases that likely
have little or no real world impact on existing implementations. I don't
see existing implementations as being the point of the discussion
though. Rather, these questions probe the theoretical underpinnings of
the design. If we don't have good answers for these questions, that may
imply that we do not understand the problem space as well as we think we do.

I'm going to again put P1885 on the SG16 agenda for this week's meeting
to discuss Corentin's intended change. I know that Corentin is unlikely
to be able to attend this week, but I think discussion will be useful
regardless.

I think LEWG should continue to polling as intended. Poll participants
can decide for themselves whether the continued discussion represents
design issues that must be settled prior to LEWG acceptance or whether
they are detail level concerns that can be addressed in follow up papers
or as LWG issues prior to C++23 being finalized. Everyone should keep in
mind that, with regard to the literal(), wide_literal(), environment(),
and wide_environment() functions, all values returned are fundamentally
implementation-defined.

Tom.

On 10/18/21 6:53 AM, Peter Brett via Lib-Ext wrote:
>
> Hi Corentin,
>
> I think that this will solve the solution nicely by keeping the P1885
> facilities focussed on solving the use-cases for which they are intended.
>
> Best regards,
>
> Peter
>
> *From:*Lib-Ext <lib-ext-bounces_at_[hidden]> *On Behalf Of
> *Corentin via Lib-Ext
> *Sent:* 15 October 2021 23:02
> *To:* Tom Honermann <tom_at_[hidden]>
> *Cc:* Corentin <corentin.jabot_at_[hidden]>; Bryce Adelstein Lelbach aka
> wash <brycelelbach_at_[hidden]>; C++ Library Evolution Working Group
> <lib-ext_at_[hidden]>; SG16 <sg16_at_[hidden]>
> *Subject:* Re: [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings
> to Demystify Them directly to electronic polling for C++23
>
> EXTERNAL MAIL
>
> Folks,
>
> The next revision will return unknown on platform where char_bits !=8,
> that way we avoid being inventive for an use case that none support or
> specify.
>
> Thanks,
>
> On Fri, Oct 15, 2021, 23:25 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 10/15/21 4:29 PM, Jens Maurer via Lib-Ext wrote:
> > On 15/10/2021 21.13, Tom Honermann wrote:
> >> The following is my attempt to describe the concerns we're
> trying to balance here.
> >>
> >> First, some informal terminology. For the following, consider
> the text "ð𑄣" consisting of two characters denoted by the Unicode
> scalar values U+00F0 and U+11123 respectively.
> >>
> >> * Encoding Form: An encoding of a sequence of characters as
> a sequence of code points. In UTF-16, the above text is encoded as
> the sequence of 3 16-bit code points { { 0x00F0 }, { 0xD804,
> 0xDD23 } }.
> > This understanding of "encoding form" does not match the ISO
> 10646 definition (clause 10).
> >
> > Quote:
> >
> > "This document provides three encoding forms expressing each UCS
> scalar value
> > in a unique sequence of one or more code units. These are named
> UTF-8, UTF-16,
> > and UTF-32 respectively."
> >
> > Thus, an encoding form maps a UCS scalar value to a sequence of
> code units.
> > You are incorrectly stating that an encoding form produces code
> points as
> > output.
> > (A code point is approximately the same as a UCS scalar value,
> which is
> > the input (not the output) of the "encoding form" mapping.)
>
> You are right of course; I let myself get too informal; I should have
> just quoted the definitions.
>
> Replace "sequence of characters" with "sequence of UCS scalar values"
> and "code points" with "code units" in my definition above.
>
> >
> >> * Encoding Scheme: An encoding of a sequence of characters
> as a sequence of endian dependent code units.
> >> o In UTF-16BE, the above text is encoded as the sequence
> of 6 8-bit code units { { 0x00, 0xF0 }, { 0xD8, 0x04, 0xDD, 0x23 } }.
> >> o In UTF-16LE, the above text is encoded as the sequence
> of 6 8-bit code units { { 0xF0, 0x00 }, { 0x04, 0xD8, 0x23, 0xDD } }.
> > The use of "code units" is confused here.
> >
> > Quote from ISO 10646 clause 11:
> >
> > "Encoding schemes are octet serializations specific to each UCS
> encoding form, ..."
> >
> > So, the encoding scheme adds an octet serialization on top of an
> encoding form.
> > The output of an encoding scheme is thus a sequence of octets.
>
> Yes. Replace "sequence of characters" with "sequence of code
> units" and
> "code units" with "bytes" or "octets".
>
> >
> >> * Encoding forms and encoding schemes are related; given
> encoding X, the encoding scheme of X is an encoding of the
> sequence of code points of the encoding form of X into a sequence
> of code units.
> >>
> >> Next, some assertions that I expect to be uncontroversial.
> >>
> >> * Bytes are not octets; they are >= 8 bits.
> >> * The number of bits in a byte is implementation-defined and
> exposed via the CHAR_BIT macro.
> >> * sizeof(char) is always 1 and therefore always 1 byte.
> >> * sizeof(wchar_t) is >= 1 and therefore 1 or more bytes.
> >> * Both Unicode and IANA restrict encoding schemes to
> > The last bullet appears to be truncated.
>
> Ugh. It's Friday and apparently I've already left for the weekend.
>
> I intended that to state that both Unicode and IANA restrict encoding
> schemes to sequences of 8-bit bytes/octets.
>
> >
> >> Implementations with the following implementation-defined
> characteristics are common:
> >>
> >> * CHAR_BIT=8 sizeof(wchar_t) == 2
> >> * CHAR_BIT=8 sizeof(wchar_t) == 4
> >>
> >> Implementations with the following implementation-defined
> characteristics are not common, but are known to exist. I don't
> know what encodings are used for character and string literals for
> these cases.
> >>
> >> * CHAR_BIT=16 sizeof(wchar_t) == 1 (Texas Instruments cl54
> C++ compiler)
> >> * CHAR_BIT=16 sizeof(wchar_t) == 2 (CEVA C++ compiler)
> >> * CHAR_BIT=32 sizeof(wchar_t) == 1 (Analog Devices C++ compiler)
> >>
> >> There are arguably five variants of each of UTF-16, UTF-32,
> UCS-2, and UCS-4:
> >>
> >> 1. The encoding form that produces a sequence of 16-bit code
> points (UTF-16, UCS-2) or 32-bit code points (UTF-32, UCS-4).
> >> 2. The big-endian encoding scheme that produces a sequence of
> 8-bit code units.
> >> 3. The little-endian encoding scheme that produces a sequence
> of 8-bit code units.
> >> 4. The native-endian encoding scheme in which the endianness
> is specified as either big or little depending on platform.
> > This one does not exist in ISO 10646.
> Correct. I added it because the SG16 consensus design requires the
> notion of a native-endian encoding scheme.
> >
> >> 5. The encoding scheme in which the endianness is determined
> by a leading BOM character (with a default otherwise; usually
> big-endian).
> >>
> >> IANA does not provide encoding identifiers that enable
> distinguishing between the five variants listed above.
> > IANA defines encoding schemes, not encoding forms, thus #1 is
> out-of-scope for
> > IANA. The other three variants defined by Unicode are actually
> represented
> > for both UTF-16 and UTF-32 (but not for UCS-2 and UCS-4).
> Correct (as reflected in the list mapping them below).
> >
> >> So, we have to decide how to map the IANA encoding
> identifiers to our intended uses. IANA intends to identify
> encoding schemes and provides the following identifiers for the
> encodings mentioned above. The numbers correspond to the numbered
> variant above.
> >>
> >> * UTF-16BE (#2)
> >> * UTF-16LE (#3)
> >> * UTF-16 (#5)
> >> * UTF-32BE (#2)
> >> * UTF-32LE (#3)
> >> * UTF-32 (#5)
> >> * ISO-10646-UCS-2 (#2, #3, #5; endianness is effectively
> unspecified)
> >> * ISO-10646-UCS-4 (#2, #3, #5; endianness is effectively
> unspecified)
> >>
> >> Fortunately, we can mostly ignore the UCS-2 and UCS-4 cases as
> being obsolete.
> > Does that mean we should simply exclude these cases from the set
> of possible
> > return values for the _literal() functions? If we don't, we
> should give
> > guidance what implementations on such platforms should do.
> I think excluding them would be appropriate, but if we don't, I agree
> guidance would be useful.
> >
> >> The text_encoding type provided by P1885 is intended to serve
> multiple use cases. Examples include discovering how literals are
> encoded, associating an encoding with a file or a network stream,
> and communicating an encoding to a conversion facility such as
> iconv(). In the case of string literals, there is an inherent
> conflict with whether an encoding form or encoding scheme is desired.
> >>
> >> Consider an implementation where sizeof(wchar_t) == 2 and wide
> literals are encoded in a UTF-16 encoding scheme. The elements of
> a wide literal string are 16-bit code points encoded in either
> big-endian or little-endian order across 2 bytes. It would
> therefore make sense for wide_literal() to return either UTF-16BE
> or UTF-16LE. However, programmers usually interact with wide
> strings at the encoding form level, so they may expect UTF-16 with
> an interpretation matching variant #1 or #4 above.
> >>
> >> Now consider an implementation where sizeof(wchar_t) == 1,
> CHAR_BIT == 16, and wide literals are encoded in the UTF-16
> encoding form. In this case, none of the encoding schemes apply.
> Programmers are likely to expect wide_literal() to return UTF-16
> with an interpretation matching variant #1 above.
> >>
> >> Finally, consider an implementation where sizeof(wchar_t) == 1,
> CHAR_BIT == 8, and wide literals are encoded in a UTF-16 encoding
> scheme. It has been argued that this configuration would violate
> [lex.charset]p3 <http://eel.is/c++draft/lex.charset#3
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!QTMuyW6RGqC3l7zu7GWUVQ58WEqUOEN3SXe0K0NCP5Y_Spw1RErL7BUL39PcOg$>>
> due to the presence of 0-valued elements that don't correspond to
> the null character. However, if this configuration was conforming,
> then wide_literal() might be expected to return UTF-16BE or
> UTF-16LE; UTF-16 would be surprising since there are no BOM
> implications and the endianness is well known and relevant when
> accessing string elements.
> >>
> >> The situation is that, for wide strings:
> >>
> >> * Programmers are likely more interested in encoding form
> than encoding scheme.
> >> * An encoding scheme may not be relevant (as in the
> sizeof(wchar_t) == 1, CHAR_BIT == 16 scenario).
> >>
> >> The SG16 compromise was to re-purpose IANA's UTF-16 identifier
> to simultaneously imply a UTF-16 encoding form (for the elements
> of the string) and, if an encoding scheme is relevant, that the
> encoding scheme is the native endianness of the wchar_t type.
> Likewise for UTF-32.
> > That's sort-of fine, but there are other wide encodings (e.g.
> wide EBCDIC encodings)
> > that would likely need similar treatment.
>
> Yes, but unless I'm mistaken, IANA does not currently specify any
> wide
> EBCDIC encodings. We can (and probably should) add some guidance
> in the
> prose of the paper based on IBM documentation, but I'm otherwise
> unaware
> of how normative guidance could be provided.
>
> >
> > (As a side note, it seems odd that we're keen on using IANA
> (i.e. a list of
> > encoding schemes) for wide_literal(), but then we make every
> effort to read
> > this as an encoding form.)
>
> Indeed, a consequence of trying to balance available specifications,
> utility, and programmer expectations.
>
> >> One of the intended guarantees is that, when sizeof(wchar_t) !=
> 1, that the underlying byte representation of a wide string
> literal match an encoding scheme associated with the encoding form
> as indicated by wide_literal(). For example:
> >>
> >> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok;
> reinterpret_cast<const char*>(L"text") yields a sequence of bytes
> that constitutes valid UTF-16BE or UTF-16LE.
> >> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16.
> Invalid; reinterpret_cast<const char*>(L"text") yields a sequence
> of bytes that is not valid UTF-16BE or UTF-16LE (due to each code
> point being stored across 4 bytes instead of 2).
> >>
> >> It may be that the paper would benefit from some updates to
> make this more clear, but I don't have any specific suggestions at
> this time.
> > I think the wording currently has no guidance for the
> sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
> > and whether it is supposed to be treated differently from the
> sizeof(wchar_t) == 2, CHAR_BIT == 16
> > case.
>
> Is this strictly a wording concern? Or do you find the design
> intent to
> be unclear? (I think you may have intended CHAR_BIT == 8 for the
> second
> case, though it doesn't really matter).
>
> Tom.
>
> >
> > Jens
> > _______________________________________________
> > Lib-Ext mailing list
> > Lib-Ext_at_[hidden] <mailto:Lib-Ext_at_[hidden]>
> > Subscription:
> https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> <https://urldefense.com/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/lib-ext__;!!EHscmS1ygiU1lA!QTMuyW6RGqC3l7zu7GWUVQ58WEqUOEN3SXe0K0NCP5Y_Spw1RErL7BXHifiezA$>
> > Link to this post:
> http://lists.isocpp.org/lib-ext/2021/10/20951.php
> <https://urldefense.com/v3/__http:/lists.isocpp.org/lib-ext/2021/10/20951.php__;!!EHscmS1ygiU1lA!QTMuyW6RGqC3l7zu7GWUVQ58WEqUOEN3SXe0K0NCP5Y_Spw1RErL7BVSZrUMfg$>
>
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/10/20998.php

Received on 2021-10-18 17:19:17