sg16: Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 19 Oct 2021 14:17:57 -0400

On 10/15/21 6:00 PM, Jens Maurer via Lib-Ext wrote:
> On 15/10/2021 23.25, Tom Honermann wrote:
>> On 10/15/21 4:29 PM, Jens Maurer via Lib-Ext wrote:
>> Yes, but unless I'm mistaken, IANA does not currently specify any wide
>> EBCDIC encodings.
> Right. I can find several EBCDIC variants for European languages,
> but nothing that looks Asian. Maybe IBM didn't bother registering
> the wide-EBCDIC variants with IANA.
>
>> We can (and probably should) add some guidance in the
>> prose of the paper based on IBM documentation, but I'm otherwise unaware
>> of how normative guidance could be provided.
> If we agree that a (hypothetical) Asian EBCDIC big-endian platform
> should return the same value from wide_literal() as a similar
> little-endian platform (and this implies native endianness),
> then I think we're essentially saying that the return value
> of wide_literal() is interpreted as an encoding form, never
> as an encoding scheme. (Which means the IANA list is not really
> applicable.)
I think there is another way to look at this.

Since sizeof(char) is 1, the only encodings that can be used for
ordinary strings are those for which the encoding form and the encoding
scheme are identical; there is no possibility of a differing underlying
representation. We can therefore say that all of literal(),
wide_literal(), environment() and wide_environment(), and therefore
text_encoding in general, identify encoding forms via their associated
encoding scheme as described by IANA (or the programmer/implementer in
the case of text_encoding::id::other or text_encoding::id::unknown).
This perspective avoids the need to special case ordinary vs wide
strings. It means that we don't have to distinguish (for example, use of
UTF-16 for ordinary strings where CHAR_BIT is 16 from use of UTF-16 for
wide strings where sizeof(wchar_t) is 1 and CHAR_BIT is 16. I
acknowledge this is only interesting in a theoretical sense; as far as I
know, no implementations use UTF-16 for ordinary strings.

>>> (As a side note, it seems odd that we're keen on using IANA (i.e. a list of
>>> encoding schemes) for wide_literal(), but then we make every effort to read
>>> this as an encoding form.)
>> Indeed, a consequence of trying to balance available specifications,
>> utility, and programmer expectations.
> Maybe we should bite the bullet and simply list the small number
> of permitted return values for wide_literal() (i.e. UTF-16 and UTF-32
> and "other"; possibly UCS-2 and UCS-4), given that IANA is only really
> helpful for the non-wide case.
>
> (Since there are no wide EBCDIC encodings in IANA, IBM needs to return
> "other" on those platforms anyway.)

In the sizeof(wchar_t) is 1 case, all encodings that are valid for
ordinary strings are also valid for wide strings. Mapping to IANA only
gets complicated when sizeof(wchar_t) is not 1. We could certainly treat
that case as special and limit IANA mapping to other, unknown, and the
UCS/UTF variants for it. If, at some point, the IANA registry is
expanded to include wide variants, we could relax the restrictions.

>>>> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>>>>
>>>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
>>>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>>>>
>>>> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
>>> I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
>>> and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
>>> case.
>> Is this strictly a wording concern? Or do you find the design intent to
>> be unclear? (I think you may have intended CHAR_BIT == 8 for the second
>> case, though it doesn't really matter).
> I did not intend CHAR_BIT == 8 for the second case. I want a wchar_t that
> has some excess bits that are unused (i.e. always 0) when used to store
> octets. (The answer is probably more obvious for CHAR_BIT == 12.)
> This is all intended to probe the "object representation" model.

Do you feel that the wording sufficiently covers CHAR_BIT being 16 for
ordinary strings (where excess bits would presumably also need to be 0)?

My intent would be for the excess bits to always be 0 as an (implied)
artifact of requiring the underlying representation to adhere to a valid
IANA registered encoding scheme for the returned encoding form (with our
twist on UTF-16 implying native endianness as opposed to use of a BOM).

Tom.

Received on 2021-10-19 13:18:00