sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sun, 3 Oct 2021 10:59:49 +0200

On 03/10/2021 03.23, Tom Honermann wrote:
> On 10/1/21 6:55 PM, Jens Maurer wrote:
>> - There are already standard ways to determine the endianess of the platform,
>> which is (arguably) orthogonal to the question of encoding form.
>
> Indeed.
>
> For me, the rationale is different. I expect a programmer to interpret UTF-16 in this context to mean that the elements of a wide string literal correspond to 16-bit code units. The fact that the underlying byte representation also happens to match UTF16LE is a secondary consideration that is mostly academic (I expect reinterpret cast to [unsigned] char or std::byte to be of rare use, especially since mutation via those types would lead to UB).

The motivation here was to support calling iconv on (wide) literals.

And iconv does not deal in wchar_t elements, it deals in "char" elements.
(Which, I think, is broken in itself when looking at wchar_t things;
we want unsigned char or std::byte instead.)

> What I'm getting at is that there are at least three distinct ways in which wide strings may be encoded in a form that purports to be a variant of UTF-16, but IANA only gives us two identifiers to differentiate them.
>
> For this case, each wchar_t object would store a single byte (not a code unit, not a code point) of the little endian serialized form of UTF-16. Hence, an 8-bit wchar_t would suffice.

And I keep telling you such an implementation is core-language
non-conforming even with the relaxation that wchar_t
represents a code unit (not a code point).

"String literal objects are initialized with the sequence of code unit
values corresponding to the string-literal’s sequence of s-char s (for a
non-raw string literal) and r-char s (for a raw string literal) in order
as follows:"

Note "code unit values". That also means that UTF-16LE and UTF-16BE
in the IANA table are irrelevant, because the integer value we're looking
at is not a byte, but a wchar_t.

And again, UTF-16 has 16-bit code units.
If you want your own encoding that does 8-bit code units based on UTF-16BE,
that's fine, but is an encoding different from UTF-16(BE/LE).

> Ok, a bit of a tangent/rant here. I think the Unicode distinction between encoding schemes and encoding forms is overly academic, not useful for design purposes, and actively complicates terminology and discussion of encodings.

For the purposes of the C++ standard, the distinction is very useful, because
wchar_t initialization only deals in encoding forms (there is no sequence of
bytes here).

iconv is only interested in encoding schemes, though, because it deals
in byte sequences.

>> is supposed to (always) return encoding names that fully specify the width an
>> endianess, so UTF16 would never be returned, but just UTF16BE and UTF16LE.
>> For UCS-4, we'd need to invent UCS4LE and UCS4BE and UCS4VAX.
>>
>> This would more directly map to the expected use-case calling iconv,
>> which always takes a sequence of bytes.
> Right, and that approach leads to ambiguity with regard to what value a wchar_t object denotes since the answer depends on sizeof(wchar_t).

I disagree; see above.

Jens

Received on 2021-10-03 03:59:56