C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 16 Oct 2021 00:02:26 +0200
Folks,
The next revision will return unknown on platform where char_bits !=8, that
way we avoid being inventive for an use case that none support or specify.

Thanks,


On Fri, Oct 15, 2021, 23:25 Tom Honermann <tom_at_[hidden]> wrote:

> On 10/15/21 4:29 PM, Jens Maurer via Lib-Ext wrote:
> > On 15/10/2021 21.13, Tom Honermann wrote:
> >> The following is my attempt to describe the concerns we're trying to
> balance here.
> >>
> >> First, some informal terminology. For the following, consider the text
> "ð𑄣" consisting of two characters denoted by the Unicode scalar values
> U+00F0 and U+11123 respectively.
> >>
> >> * Encoding Form: An encoding of a sequence of characters as a
> sequence of code points. In UTF-16, the above text is encoded as the
> sequence of 3 16-bit code points { { 0x00F0 }, { 0xD804, 0xDD23 } }.
> > This understanding of "encoding form" does not match the ISO 10646
> definition (clause 10).
> >
> > Quote:
> >
> > "This document provides three encoding forms expressing each UCS scalar
> value
> > in a unique sequence of one or more code units. These are named UTF-8,
> UTF-16,
> > and UTF-32 respectively."
> >
> > Thus, an encoding form maps a UCS scalar value to a sequence of code
> units.
> > You are incorrectly stating that an encoding form produces code points as
> > output.
> > (A code point is approximately the same as a UCS scalar value, which is
> > the input (not the output) of the "encoding form" mapping.)
>
> You are right of course; I let myself get too informal; I should have
> just quoted the definitions.
>
> Replace "sequence of characters" with "sequence of UCS scalar values"
> and "code points" with "code units" in my definition above.
>
> >
> >> * Encoding Scheme: An encoding of a sequence of characters as a
> sequence of endian dependent code units.
> >> o In UTF-16BE, the above text is encoded as the sequence of 6
> 8-bit code units { { 0x00, 0xF0 }, { 0xD8, 0x04, 0xDD, 0x23 } }.
> >> o In UTF-16LE, the above text is encoded as the sequence of 6
> 8-bit code units { { 0xF0, 0x00 }, { 0x04, 0xD8, 0x23, 0xDD } }.
> > The use of "code units" is confused here.
> >
> > Quote from ISO 10646 clause 11:
> >
> > "Encoding schemes are octet serializations specific to each UCS encoding
> form, ..."
> >
> > So, the encoding scheme adds an octet serialization on top of an
> encoding form.
> > The output of an encoding scheme is thus a sequence of octets.
>
> Yes. Replace "sequence of characters" with "sequence of code units" and
> "code units" with "bytes" or "octets".
>
> >
> >> * Encoding forms and encoding schemes are related; given encoding X,
> the encoding scheme of X is an encoding of the sequence of code points of
> the encoding form of X into a sequence of code units.
> >>
> >> Next, some assertions that I expect to be uncontroversial.
> >>
> >> * Bytes are not octets; they are >= 8 bits.
> >> * The number of bits in a byte is implementation-defined and exposed
> via the CHAR_BIT macro.
> >> * sizeof(char) is always 1 and therefore always 1 byte.
> >> * sizeof(wchar_t) is >= 1 and therefore 1 or more bytes.
> >> * Both Unicode and IANA restrict encoding schemes to
> > The last bullet appears to be truncated.
>
> Ugh. It's Friday and apparently I've already left for the weekend.
>
> I intended that to state that both Unicode and IANA restrict encoding
> schemes to sequences of 8-bit bytes/octets.
>
> >
> >> Implementations with the following implementation-defined
> characteristics are common:
> >>
> >> * CHAR_BIT=8 sizeof(wchar_t) == 2
> >> * CHAR_BIT=8 sizeof(wchar_t) == 4
> >>
> >> Implementations with the following implementation-defined
> characteristics are not common, but are known to exist. I don't know what
> encodings are used for character and string literals for these cases.
> >>
> >> * CHAR_BIT=16 sizeof(wchar_t) == 1 (Texas Instruments cl54 C++
> compiler)
> >> * CHAR_BIT=16 sizeof(wchar_t) == 2 (CEVA C++ compiler)
> >> * CHAR_BIT=32 sizeof(wchar_t) == 1 (Analog Devices C++ compiler)
> >>
> >> There are arguably five variants of each of UTF-16, UTF-32, UCS-2, and
> UCS-4:
> >>
> >> 1. The encoding form that produces a sequence of 16-bit code points
> (UTF-16, UCS-2) or 32-bit code points (UTF-32, UCS-4).
> >> 2. The big-endian encoding scheme that produces a sequence of 8-bit
> code units.
> >> 3. The little-endian encoding scheme that produces a sequence of
> 8-bit code units.
> >> 4. The native-endian encoding scheme in which the endianness is
> specified as either big or little depending on platform.
> > This one does not exist in ISO 10646.
> Correct. I added it because the SG16 consensus design requires the
> notion of a native-endian encoding scheme.
> >
> >> 5. The encoding scheme in which the endianness is determined by a
> leading BOM character (with a default otherwise; usually big-endian).
> >>
> >> IANA does not provide encoding identifiers that enable distinguishing
> between the five variants listed above.
> > IANA defines encoding schemes, not encoding forms, thus #1 is
> out-of-scope for
> > IANA. The other three variants defined by Unicode are actually
> represented
> > for both UTF-16 and UTF-32 (but not for UCS-2 and UCS-4).
> Correct (as reflected in the list mapping them below).
> >
> >> So, we have to decide how to map the IANA encoding identifiers to
> our intended uses. IANA intends to identify encoding schemes and provides
> the following identifiers for the encodings mentioned above. The numbers
> correspond to the numbered variant above.
> >>
> >> * UTF-16BE (#2)
> >> * UTF-16LE (#3)
> >> * UTF-16 (#5)
> >> * UTF-32BE (#2)
> >> * UTF-32LE (#3)
> >> * UTF-32 (#5)
> >> * ISO-10646-UCS-2 (#2, #3, #5; endianness is effectively unspecified)
> >> * ISO-10646-UCS-4 (#2, #3, #5; endianness is effectively unspecified)
> >>
> >> Fortunately, we can mostly ignore the UCS-2 and UCS-4 cases as being
> obsolete.
> > Does that mean we should simply exclude these cases from the set of
> possible
> > return values for the _literal() functions? If we don't, we should give
> > guidance what implementations on such platforms should do.
> I think excluding them would be appropriate, but if we don't, I agree
> guidance would be useful.
> >
> >> The text_encoding type provided by P1885 is intended to serve multiple
> use cases. Examples include discovering how literals are encoded,
> associating an encoding with a file or a network stream, and communicating
> an encoding to a conversion facility such as iconv(). In the case of string
> literals, there is an inherent conflict with whether an encoding form or
> encoding scheme is desired.
> >>
> >> Consider an implementation where sizeof(wchar_t) == 2 and wide literals
> are encoded in a UTF-16 encoding scheme. The elements of a wide literal
> string are 16-bit code points encoded in either big-endian or little-endian
> order across 2 bytes. It would therefore make sense for wide_literal() to
> return either UTF-16BE or UTF-16LE. However, programmers usually interact
> with wide strings at the encoding form level, so they may expect UTF-16
> with an interpretation matching variant #1 or #4 above.
> >>
> >> Now consider an implementation where sizeof(wchar_t) == 1, CHAR_BIT ==
> 16, and wide literals are encoded in the UTF-16 encoding form. In this
> case, none of the encoding schemes apply. Programmers are likely to expect
> wide_literal() to return UTF-16 with an interpretation matching variant #1
> above.
> >>
> >> Finally, consider an implementation where sizeof(wchar_t) == 1,
> CHAR_BIT == 8, and wide literals are encoded in a UTF-16 encoding scheme.
> It has been argued that this configuration would violate [lex.charset]p3 <
> http://eel.is/c++draft/lex.charset#3> due to the presence of 0-valued
> elements that don't correspond to the null character. However, if this
> configuration was conforming, then wide_literal() might be expected to
> return UTF-16BE or UTF-16LE; UTF-16 would be surprising since there are no
> BOM implications and the endianness is well known and relevant when
> accessing string elements.
> >>
> >> The situation is that, for wide strings:
> >>
> >> * Programmers are likely more interested in encoding form than
> encoding scheme.
> >> * An encoding scheme may not be relevant (as in the sizeof(wchar_t)
> == 1, CHAR_BIT == 16 scenario).
> >>
> >> The SG16 compromise was to re-purpose IANA's UTF-16 identifier to
> simultaneously imply a UTF-16 encoding form (for the elements of the
> string) and, if an encoding scheme is relevant, that the encoding scheme is
> the native endianness of the wchar_t type. Likewise for UTF-32.
> > That's sort-of fine, but there are other wide encodings (e.g. wide
> EBCDIC encodings)
> > that would likely need similar treatment.
>
> Yes, but unless I'm mistaken, IANA does not currently specify any wide
> EBCDIC encodings. We can (and probably should) add some guidance in the
> prose of the paper based on IBM documentation, but I'm otherwise unaware
> of how normative guidance could be provided.
>
> >
> > (As a side note, it seems odd that we're keen on using IANA (i.e. a list
> of
> > encoding schemes) for wide_literal(), but then we make every effort to
> read
> > this as an encoding form.)
>
> Indeed, a consequence of trying to balance available specifications,
> utility, and programmer expectations.
>
> >> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that
> the underlying byte representation of a wide string literal match an
> encoding scheme associated with the encoding form as indicated by
> wide_literal(). For example:
> >>
> >> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok;
> reinterpret_cast<const char*>(L"text") yields a sequence of bytes that
> constitutes valid UTF-16BE or UTF-16LE.
> >> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid;
> reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is
> not valid UTF-16BE or UTF-16LE (due to each code point being stored across
> 4 bytes instead of 2).
> >>
> >> It may be that the paper would benefit from some updates to make this
> more clear, but I don't have any specific suggestions at this time.
> > I think the wording currently has no guidance for the sizeof(wchar_t) ==
> 1, CHAR_BIT == 16 case,
> > and whether it is supposed to be treated differently from the
> sizeof(wchar_t) == 2, CHAR_BIT == 16
> > case.
>
> Is this strictly a wording concern? Or do you find the design intent to
> be unclear? (I think you may have intended CHAR_BIT == 8 for the second
> case, though it doesn't really matter).
>
> Tom.
>
> >
> > Jens
> > _______________________________________________
> > Lib-Ext mailing list
> > Lib-Ext_at_[hidden]
> > Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> > Link to this post: http://lists.isocpp.org/lib-ext/2021/10/20951.php
>
>
>

Received on 2021-10-15 17:02:41