sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 1 Oct 2021 18:14:12 -0400

On 10/1/21 4:17 PM, Jens Maurer wrote:
> On 01/10/2021 19.40, Tom Honermann via SG16 wrote:
>> * How is the IANA registry intended to be applied? Which IANA encoding would be considered a match for each of the following cases?
> My guess is we're specifically discussing the return value of the wide_literal()
> function in the proposal.
Yes.
> None of the three cases below is describing a conforming implementation of (core language) C++
> to start with, so these questions leave me confused as to their applicability to standardizing
> something like P1885.
For the moment, let's assume that we adopt a resolution for D2460R0 that
allows the use of a variable length encoding for the wide literal encoding.
>
> Assuming the core language restrictions are lifted (and the specification
> interactions with C and the wide-character functions from C analyzed):
>
>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is >= 8, little endian architecture.
> UTF16
Why not UTF16LE? (I know why, but I'd like to hear what is offered for
rationale).
>
>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is >= 16, architecture endianness is irrelevant since code units are a single byte.
> UTF16

This is, of course, the right answer. But I've seen claims in some of
the email threads that the IANA registered encodings correspond to
encoding schemes in which case, each wchar_t element would correspond to
a byte/octet of either the UTF16-BE or UTF-16LE encoding schemes. On the
other hand, the paper states:

> "A registered character encoding is a character encoding form in the
IANA Character Sets registry."

>
>> o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is >= 8, architecture endianness is irrelevant since code units are a single byte.
> That was a bit terse. Ok, you mean an implementation that uses wchar_t same size as char
> and puts wide literals in a sequence of byte-sized wchar_t items with UTF-16LE encoding.
Yes.
> Note that code units are NOT a single byte (it's UTF-16, so code units are 16 bits,
> but a byte can be 8 bits in this scenario).
Yes, my bad, a cut and paste bug.
>
> It feels this is a particularly non-conforming implementation, because wchar_t can't
> even hold a UTF-16 code unit (which needs 16-bit for storage). I think the given
> scenario is just out-of-scope for C++.

My intent was that wchar_t values correspond to bytes/octets as encoded
with UTF-16LE here. If the current wchar_t restriction is lifted as
suggested above, I believe this would be conforming and I would expect
wide_literal() to return UTF16LE.

A similar concern can be illustrated with char:

  * Ordinary literal encoding is UTF-16, CHAR_BIT is >= 16, each char
    element is a code unit of the encoding form.
  * Ordinary literal encoding is UTF-16LE, CHAR_BIT is >= 8, each char
    element is a byte of the encoding scheme.

If we identify these as UTF16 and UTF16LE (as we should), then we aren't
being consistent with regard to use of the IANA registered encodings as
encoding schemes or encoding forms. How do we specify which encodings
denote encoding schemes and which ones denote encoding forms? Neither
the IANA registry nor the referenced RFCs are clear here, particularly
for UTF16. Jens' answers above are the ones that we want, but I don't
think the paper specifies that, nor provides rationale.

To be clear, I believe what we want is:

  * For UTF16, each char or wchar_t element corresponds to a code unit.
  * For UTF16LE and UTF16BE, each char or wchar_t element corresponds to
    a byte/octet.

The paper attempts to avoid these questions by stating this is all
implementation-defined and that is probably fine; I'm asking these
questions more to ensure the paper is clear in intent and wording and to
ensure we're consistent with regard to programmers expectations.

Tom.

Received on 2021-10-01 17:14:15