C++ Logo


Advanced search

Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 19 Oct 2021 16:10:15 -0400
On 10/19/21 3:07 PM, Jens Maurer via Lib-Ext wrote:
>>>>>> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>>>>>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
>>>>>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>>>>>> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
>>>>> I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
>>>>> and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
>>>>> case.
>>>> Is this strictly a wording concern? Or do you find the design intent to
>>>> be unclear? (I think you may have intended CHAR_BIT == 8 for the second
>>>> case, though it doesn't really matter).
>>> I did not intend CHAR_BIT == 8 for the second case. I want a wchar_t that
>>> has some excess bits that are unused (i.e. always 0) when used to store
>>> octets. (The answer is probably more obvious for CHAR_BIT == 12.)
>>> This is all intended to probe the "object representation" model.
>> Do you feel that the wording sufficiently covers CHAR_BIT being 16 for
>> ordinary strings (where excess bits would presumably also need to be 0)?
> IANA talks about encoding schemes, which refers to octets, not bytes.
> But C++ deals in bytes, not octets.
> With CHAR_BIT = 16 and UTF-16, I can imagine two possible layouts:
> One that puts a UTF-16 code unit into each char, and one that puts
> the octets of (e.g.) UTF16BE into consecutive chars, so that two
> chars form a UTF-16 code unit and each char stores a value <= 255.

I think we previously determined that the latter runs afoul of
[lex.charset]p3 <http://eel.is/c++draft/lex.charset#3> because there
would be 0 valued elements that do not correspond to the null character
(this effectively corresponds to a multibyte encoding in which trailing
code units may have 0 values).

If we take the perspective that what is returned indicates an encoding
form, then an implementation that does the second thing would have to
return other or unknown.

> Files are essentially sequences of bytes (chars), so when reading
> an external UTF-16 file, I actually expect the second layout to
> appear, even though that means every second octet in memory
> is 0 (although you can't really observe that octet in isolation).

I don't have sufficient experience with CHAR_BIT = 16 implementations to
know what they actually do, but my intuition is that bytes in files
would be mapped to bytes in memory as you indicate. I don't know whether
to expect UTF-16 files on such an implementation to map code units to
bytes or octets to bytes.

> From another angle, how would you expect UTF-8 to be represented
> in chars on a CHAR_BIT = 16 platform? Put two UTF-8 code units
> into a single char in some (which?) byte order (because we have the
> space for it)?
I think I can side-step the question by stating that an implementation
that does the latter would have to return other or unknown because
accessing the string would not yield values corresponding to the
encoding form.
>> My intent would be for the excess bits to always be 0 as an (implied)
>> artifact of requiring the underlying representation to adhere to a valid
>> IANA registered encoding scheme for the returned encoding form (with our
>> twist on UTF-16 implying native endianness as opposed to use of a BOM).
> For the case of CHAR_BIT == 16 and UTF-16 code units split across two chars,
> there is no "native" endianess how to split up a UTF-16 code unit into
> consecutive chars.

With the perspective that the IDs correspond to encoding forms, then
there is no suitable IANA mapping for that case.


> Jens
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post: http://lists.isocpp.org/lib-ext/2021/10/21037.php

Received on 2021-10-19 15:10:20