C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 16 Oct 2021 00:00:36 +0200
On 15/10/2021 23.25, Tom Honermann wrote:
> On 10/15/21 4:29 PM, Jens Maurer via Lib-Ext wrote:

> Yes, but unless I'm mistaken, IANA does not currently specify any wide
> EBCDIC encodings.

Right. I can find several EBCDIC variants for European languages,
but nothing that looks Asian. Maybe IBM didn't bother registering
the wide-EBCDIC variants with IANA.

> We can (and probably should) add some guidance in the
> prose of the paper based on IBM documentation, but I'm otherwise unaware
> of how normative guidance could be provided.

If we agree that a (hypothetical) Asian EBCDIC big-endian platform
should return the same value from wide_literal() as a similar
little-endian platform (and this implies native endianness),
then I think we're essentially saying that the return value
of wide_literal() is interpreted as an encoding form, never
as an encoding scheme. (Which means the IANA list is not really
applicable.)

>> (As a side note, it seems odd that we're keen on using IANA (i.e. a list of
>> encoding schemes) for wide_literal(), but then we make every effort to read
>> this as an encoding form.)
>
> Indeed, a consequence of trying to balance available specifications,
> utility, and programmer expectations.

Maybe we should bite the bullet and simply list the small number
of permitted return values for wide_literal() (i.e. UTF-16 and UTF-32
and "other"; possibly UCS-2 and UCS-4), given that IANA is only really
helpful for the non-wide case.

(Since there are no wide EBCDIC encodings in IANA, IBM needs to return
"other" on those platforms anyway.)

>>> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>>>
>>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
>>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>>>
>>> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
>> I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
>> and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
>> case.
>
> Is this strictly a wording concern? Or do you find the design intent to
> be unclear? (I think you may have intended CHAR_BIT == 8 for the second
> case, though it doesn't really matter).

I did not intend CHAR_BIT == 8 for the second case. I want a wchar_t that
has some excess bits that are unused (i.e. always 0) when used to store
octets. (The answer is probably more obvious for CHAR_BIT == 12.)
This is all intended to probe the "object representation" model.

Jens

Received on 2021-10-15 17:00:46