C++ Logo

sg16

Advanced search

Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 2 Oct 2021 21:23:30 -0400
On 10/1/21 6:55 PM, Jens Maurer wrote:
> On 02/10/2021 00.14, Tom Honermann wrote:
>> On 10/1/21 4:17 PM, Jens Maurer wrote:
>>> On 01/10/2021 19.40, Tom Honermann via SG16 wrote:
>>>> * How is the IANA registry intended to be applied? Which IANA encoding would be considered a match for each of the following cases?
>>> My guess is we're specifically discussing the return value of the wide_literal()
>>> function in the proposal.
>> Yes.
>>> None of the three cases below is describing a conforming implementation of (core language) C++
>>> to start with, so these questions leave me confused as to their applicability to standardizing
>>> something like P1885.
>> For the moment, let's assume that we adopt a resolution for D2460R0 that allows the use of a variable length encoding for the wide literal encoding.
>>> Assuming the core language restrictions are lifted (and the specification
>>> interactions with C and the wide-character functions from C analyzed):
>>>
>>>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is >= 8, little endian architecture.
>>> UTF16
>> Why not UTF16LE? (I know why, but I'd like to hear what is offered for rationale).
> - This is most consistent with (the absence of) differentiation for e.g. UCS-2 and UCS-4
> and other similar wide encodings.
That is interesting; I had not noticed that there is no registered
encoding for the BE and LE variants of these.
> - There are already standard ways to determine the endianess of the platform,
> which is (arguably) orthogonal to the question of encoding form.

Indeed.

For me, the rationale is different. I expect a programmer to interpret
UTF-16 in this context to mean that the elements of a wide string
literal correspond to 16-bit code units. The fact that the underlying
byte representation also happens to match UTF16LE is a secondary
consideration that is mostly academic (I expect reinterpret cast to
[unsigned] char or std::byte to be of rare use, especially since
mutation via those types would lead to UB).

>
>>>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is >= 16, architecture endianness is irrelevant since code units are a single byte.
>>> UTF16
>> This is, of course, the right answer. But I've seen claims in some of the email threads that the IANA registered encodings correspond to encoding schemes in which case, each wchar_t element would correspond to a byte/octet of either the UTF16-BE or UTF-16LE encoding schemes. On the other hand, the paper states:
>>> "A registered character encoding is a character encoding form in the IANA Character Sets registry."
> The IANA registry is certainly confused in that it offers all of UTF16 and UTF16LE and UTF16BE
> as alternatives. This smells like a category error.
>
> In particular since other wide encodings shown (e.g. UCS-2 and UCS-4) don't show
> such differentiation, although the endianess diversity obviously applies to them,
> too. (Assuming the resulting byte sequence is the interesting property.)
I agree; the IANA registry is not clear about what it specifies.
>
>>>> o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is >= 8, architecture endianness is irrelevant since code units are a single byte.
>>> That was a bit terse. Ok, you mean an implementation that uses wchar_t same size as char
>>> and puts wide literals in a sequence of byte-sized wchar_t items with UTF-16LE encoding.
>> Yes.
>>> Note that code units are NOT a single byte (it's UTF-16, so code units are 16 bits,
>>> but a byte can be 8 bits in this scenario).
>> Yes, my bad, a cut and paste bug.
> ... and what is the question you actually wanted to ask?
> I still don't get it.

What I'm getting at is that there are at least three distinct ways in
which wide strings may be encoded in a form that purports to be a
variant of UTF-16, but IANA only gives us two identifiers to
differentiate them.

>
>>> It feels this is a particularly non-conforming implementation, because wchar_t can't
>>> even hold a UTF-16 code unit (which needs 16-bit for storage). I think the given
>>> scenario is just out-of-scope for C++.
>> My intent was that wchar_t values correspond to bytes/octets as encoded with UTF-16LE here. If the current wchar_t restriction is lifted as suggested above, I believe this would be conforming and I would expect wide_literal() to return UTF16LE.
> Even after lifting the restriction on wchar_t, I continue to believe that a single
> wchar_t object should be able to hold a single code unit (not: code point) of the
> encoding. The code units of UTF-16LE are still 16-bit quantities, so an 8-bit
> wchar_t would not be conforming.

For this case, each wchar_t object would store a single byte (not a code
unit, not a code point) of the little endian serialized form of UTF-16.
Hence, an 8-bit wchar_t would suffice.

> Two follow-on thoughts:
>
> - It would seem odd to have a platform that uses one endianess for UTF-16 code units
> and another one for the rest of the integers. If we do not admit such possibility,
> we don't ever need UTF16BE or UTF16LE (because the endianess is implied by the
> platform endianess).
That assumes that the elements of the literal encoding are code units,
but I agree otherwise.
>
> - The preceding bullet applies to wide_literal() and friends, which exist on a
> given platform. When considering files (streams of octets), there is no implied
> platform endianess, and the differentiation UTF16LE vs. UTF16BE does make sense.
Indeed, but the distinction equally applies to bytes in memory when
accessed as bytes.
>
>> A similar concern can be illustrated with char:
>>
>> * Ordinary literal encoding is UTF-16, CHAR_BIT is >= 16, each char element is a code unit of the encoding form.
>> * Ordinary literal encoding is UTF-16LE, CHAR_BIT is >= 8, each char element is a byte of the encoding scheme.
>>
>> If we identify these as UTF16 and UTF16LE (as we should),
> Again, I disagree. UTF-16LE has 16-bit code units, which don't fit into an 8-bit char,
> so this is non-conforming. If you wish to define your own encoding that has 8-bit
> code units created by a UTF16LE sequence, feel free to do so and label it
> Tom16 or so.
Ok, a bit of a tangent/rant here. I think the Unicode distinction
between encoding schemes and encoding forms is overly academic, not
useful for design purposes, and actively complicates terminology and
discussion of encodings. When designing text_view
<https://github.com/tahonermann/text_view>, I chose not to distinguish
between them and defined UTF-16 to encode/decode a sequence of 16-bit
elements (corresponding to the code units of the encoding form) and
defined UTF-16BE and UTF-16LE to encode/decode a sequence of 8-bit
elements (corresponding to the bytes/octets of the encoding scheme).
This choice was made because that is what is actually useful; byte
swapped 16-bit code unit values are of no use.
>
>> then we aren't being consistent with regard to use of the IANA registered encodings as encoding schemes or encoding forms. How do we specify which encodings denote encoding schemes and which ones denote encoding forms?
> And which ones should wide_literal() return?
Yes, exactly. From the perspective of intent, how do we express that as
a general rule?
>
>> Neither the IANA registry nor the referenced RFCs are clear here, particularly for UTF16. Jens' answers above are the ones that we want, but I don't think the paper specifies that, nor provides rationale.
>>
>> To be clear, I believe what we want is:
>>
>> * For UTF16, each char or wchar_t element corresponds to a code unit.
>> * For UTF16LE and UTF16BE, each char or wchar_t element corresponds to a byte/octet.
> An intermediate stage of discussion with Hubert was that the implementation
> is supposed to (always) return encoding names that fully specify the width an
> endianess, so UTF16 would never be returned, but just UTF16BE and UTF16LE.
> For UCS-4, we'd need to invent UCS4LE and UCS4BE and UCS4VAX.
>
> This would more directly map to the expected use-case calling iconv,
> which always takes a sequence of bytes.
Right, and that approach leads to ambiguity with regard to what value a
wchar_t object denotes since the answer depends on sizeof(wchar_t).
>> The paper attempts to avoid these questions by stating this is all implementation-defined and that is probably fine; I'm asking these questions more to ensure the paper is clear in intent and wording and to ensure we're consistent with regard to programmers expectations.
> I understand we can require very little in this area normatively
> (except probably the handling of Unicode), but we should nonetheless
> agree on and give clear guidance what implementations should do.
> Otherwise, we'll just get different return values from different
> compilers on the same platform, which helps nobody.

Yes, that is exactly the issue I want to see addressed.

Tom.


Received on 2021-10-02 20:23:34