sg16: Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 19 Oct 2021 22:38:18 +0200

Essentially agreed with your points;
I notice that talking about "encoding form"
(not encoding scheme and object representation)
answers a lot of the questions "naturally" and
seems to do what we want.

Jens

On 19/10/2021 22.10, Tom Honermann wrote:
> On 10/19/21 3:07 PM, Jens Maurer via Lib-Ext wrote:
>>>>>>> One of the intended guarantees is that, when sizeof(wchar_t) != 1, that the underlying byte representation of a wide string literal match an encoding scheme associated with the encoding form as indicated by wide_literal(). For example:
>>>>>>>
>>>>>>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that constitutes valid UTF-16BE or UTF-16LE.
>>>>>>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is not valid UTF-16BE or UTF-16LE (due to each code point being stored across 4 bytes instead of 2).
>>>>>>>
>>>>>>> It may be that the paper would benefit from some updates to make this more clear, but I don't have any specific suggestions at this time.
>>>>>> I think the wording currently has no guidance for the sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
>>>>>> and whether it is supposed to be treated differently from the sizeof(wchar_t) == 2, CHAR_BIT == 16
>>>>>> case.
>>>>> Is this strictly a wording concern? Or do you find the design intent to
>>>>> be unclear? (I think you may have intended CHAR_BIT == 8 for the second
>>>>> case, though it doesn't really matter).
>>>> I did not intend CHAR_BIT == 8 for the second case. I want a wchar_t that
>>>> has some excess bits that are unused (i.e. always 0) when used to store
>>>> octets. (The answer is probably more obvious for CHAR_BIT == 12.)
>>>> This is all intended to probe the "object representation" model.
>>> Do you feel that the wording sufficiently covers CHAR_BIT being 16 for
>>> ordinary strings (where excess bits would presumably also need to be 0)?
>> IANA talks about encoding schemes, which refers to octets, not bytes.
>> But C++ deals in bytes, not octets.
>> With CHAR_BIT = 16 and UTF-16, I can imagine two possible layouts:
>> One that puts a UTF-16 code unit into each char, and one that puts
>> the octets of (e.g.) UTF16BE into consecutive chars, so that two
>> chars form a UTF-16 code unit and each char stores a value <= 255.
>
> I think we previously determined that the latter runs afoul of [lex.charset]p3 <http://eel.is/c++draft/lex.charset#3> because there would be 0 valued elements that do not correspond to the null character (this effectively corresponds to a multibyte encoding in which trailing code units may have 0 values).
>
> If we take the perspective that what is returned indicates an encoding form, then an implementation that does the second thing would have to return other or unknown.
>
>> Files are essentially sequences of bytes (chars), so when reading
>> an external UTF-16 file, I actually expect the second layout to
>> appear, even though that means every second octet in memory
>> is 0 (although you can't really observe that octet in isolation).
>
> I don't have sufficient experience with CHAR_BIT = 16 implementations to know what they actually do, but my intuition is that bytes in files would be mapped to bytes in memory as you indicate. I don't know whether to expect UTF-16 files on such an implementation to map code units to bytes or octets to bytes.
>
>> >From another angle, how would you expect UTF-8 to be represented
>> in chars on a CHAR_BIT = 16 platform? Put two UTF-8 code units
>> into a single char in some (which?) byte order (because we have the
>> space for it)?
> I think I can side-step the question by stating that an implementation that does the latter would have to return other or unknown because accessing the string would not yield values corresponding to the encoding form.
>>> My intent would be for the excess bits to always be 0 as an (implied)
>>> artifact of requiring the underlying representation to adhere to a valid
>>> IANA registered encoding scheme for the returned encoding form (with our
>>> twist on UTF-16 implying native endianness as opposed to use of a BOM).
>> For the case of CHAR_BIT == 16 and UTF-16 code units split across two chars,
>> there is no "native" endianess how to split up a UTF-16 code unit into
>> consecutive chars.
>
> With the perspective that the IDs correspond to encoding forms, then there is no suitable IANA mapping for that case.
>
> Tom.
>
>> Jens
>> _______________________________________________
>> Lib-Ext mailing list
>> Lib-Ext_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
>> Link to this post: http://lists.isocpp.org/lib-ext/2021/10/21037.php
>
>

Received on 2021-10-19 15:38:24