sg16: Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 19 Oct 2021 21:07:04 +0200

On 19/10/2021 20.17, Tom Honermann wrote:
> On 10/15/21 6:00 PM, Jens Maurer via Lib-Ext wrote:
>> On 15/10/2021 23.25, Tom Honermann wrote:
>>> On 10/15/21 4:29 PM, Jens Maurer via Lib-Ext wrote:
>>> Yes, but unless I'm mistaken, IANA does not currently specify any wide
>>> EBCDIC encodings.
>> Right. I can find several EBCDIC variants for European languages,
>> but nothing that looks Asian. Maybe IBM didn't bother registering
>> the wide-EBCDIC variants with IANA.
>>
>>> We can (and probably should) add some guidance in the
>>> prose of the paper based on IBM documentation, but I'm otherwise unaware
>>> of how normative guidance could be provided.
>> If we agree that a (hypothetical) Asian EBCDIC big-endian platform
>> should return the same value from wide_literal() as a similar
>> little-endian platform (and this implies native endianness),
>> then I think we're essentially saying that the return value
>> of wide_literal() is interpreted as an encoding form, never
>> as an encoding scheme. (Which means the IANA list is not really
>> applicable.)
> I think there is another way to look at this.
>
> Since sizeof(char) is 1, the only encodings that can be used for
> ordinary strings are those for which the encoding form and the encoding
> scheme are identical; there is no possibility of a differing underlying
> representation. We can therefore say that all of literal(),
> wide_literal(), environment() and wide_environment(), and therefore
> text_encoding in general, identify encoding forms via their associated
> encoding scheme as described by IANA (or the programmer/implementer in
> the case of text_encoding::id::other or text_encoding::id::unknown).
> This perspective avoids the need to special case ordinary vs wide
> strings. It means that we don't have to distinguish (for example, use of
> UTF-16 for ordinary strings where CHAR_BIT is 16 from use of UTF-16 for
> wide strings where sizeof(wchar_t) is 1 and CHAR_BIT is 16. I
> acknowledge this is only interesting in a theoretical sense; as far as I
> know, no implementations use UTF-16 for ordinary strings.

The problem is that information on the level of "encoding scheme"
actually conveys more information (i.e. the endianess) than information
on the level of "encoding form", which means we have to jump through
hoops to tell implementers to return UTF16, but not UTF16LE/BE,
exactly for the purpose of erasing the encoding scheme-specific
information. (Adding to that, UTF16 as returned is not actually
the same kind of UTF16 as the one prescribed by IANA / ISO 10646;
cf. BOM and defaulting to big-endian. We sidestep that question by
mechanical value replacement in the current state of the proposal.)

As you observed, for ordinary strings (those expressed via "char"),
encoding form and encoding scheme are identical, so everything would
be good if we consistently talked about encoding forms.
Except that the IANA list is one presenting encoding schemes.
But maybe we can apply some specification-mapping to the IANA list
to strip it down to encoding forms.

>>>> (As a side note, it seems odd that we're keen on using IANA (i.e. a list of
>>>> encoding schemes) for wide_literal(), but then we make every effort to read
>>>> this as an encoding form.)
>>> Indeed, a consequence of trying to balance available specifications,
>>> utility, and programmer expectations.
>> Maybe we should bite the bullet and simply list the small number
>> of permitted return values for wide_literal() (i.e. UTF-16 and UTF-32
>> and "other"; possibly UCS-2 and UCS-4), given that IANA is only really
>> helpful for the non-wide case.
>>
>> (Since there are no wide EBCDIC encodings in IANA, IBM needs to return
>> "other" on those platforms anyway.)
>
> In the sizeof(wchar_t) is 1 case, all encodings that are valid for
> ordinary strings are also valid for wide strings. Mapping to IANA only
> gets complicated when sizeof(wchar_t) is not 1.

Yes, but the latter is the 99% case, I guess.

> We could certainly treat
> that case as special and limit IANA mapping to other, unknown, and the
> UCS/UTF variants for it. If, at some point, the IANA registry is
> expanded to include wide variants, we could relax the restrictions.

Yes.

>>>>> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>>>>>
>>>>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
>>>>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>>>>>
>>>>> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
>>>> I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
>>>> and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
>>>> case.
>>> Is this strictly a wording concern? Or do you find the design intent to
>>> be unclear? (I think you may have intended CHAR_BIT == 8 for the second
>>> case, though it doesn't really matter).
>> I did not intend CHAR_BIT == 8 for the second case. I want a wchar_t that
>> has some excess bits that are unused (i.e. always 0) when used to store
>> octets. (The answer is probably more obvious for CHAR_BIT == 12.)
>> This is all intended to probe the "object representation" model.
>
> Do you feel that the wording sufficiently covers CHAR_BIT being 16 for
> ordinary strings (where excess bits would presumably also need to be 0)?

IANA talks about encoding schemes, which refers to octets, not bytes.
But C++ deals in bytes, not octets.
With CHAR_BIT = 16 and UTF-16, I can imagine two possible layouts:
One that puts a UTF-16 code unit into each char, and one that puts
the octets of (e.g.) UTF16BE into consecutive chars, so that two
chars form a UTF-16 code unit and each char stores a value <= 255.

Files are essentially sequences of bytes (chars), so when reading
an external UTF-16 file, I actually expect the second layout to
appear, even though that means every second octet in memory
is 0 (although you can't really observe that octet in isolation).

From another angle, how would you expect UTF-8 to be represented
in chars on a CHAR_BIT = 16 platform? Put two UTF-8 code units
into a single char in some (which?) byte order (because we have the
space for it)?

> My intent would be for the excess bits to always be 0 as an (implied)
> artifact of requiring the underlying representation to adhere to a valid
> IANA registered encoding scheme for the returned encoding form (with our
> twist on UTF-16 implying native endianness as opposed to use of a BOM).

For the case of CHAR_BIT == 16 and UTF-16 code units split across two chars,
there is no "native" endianess how to split up a UTF-16 code unit into
consecutive chars.

Jens

Received on 2021-10-19 14:07:13