On 10/1/21 4:17 PM, Jens Maurer wrote:

On 01/10/2021 19.40, Tom Honermann via SG16 wrote:

  * How is the IANA registry intended to be applied? Which IANA encoding would be considered a match for each of the following cases?

My guess is we're specifically discussing the return value of the wide_literal()
function in the proposal.

Yes.

None of the three cases below is describing a conforming implementation of (core language) C++
to start with, so these questions leave me confused as to their applicability to standardizing
something like P1885.

For the moment, let's assume that we adopt a resolution for D2460R0 that allows the use of a variable length encoding for the wide literal encoding.


Assuming the core language restrictions are lifted (and the specification
interactions with C and the wide-character functions from C analyzed):

      o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is >= 8, little endian architecture.

UTF16

Why not UTF16LE? (I know why, but I'd like to hear what is offered for rationale).

      o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is >= 16, architecture endianness is irrelevant since code units are a single byte.

UTF16

This is, of course, the right answer. But I've seen claims in some of the email threads that the IANA registered encodings correspond to encoding schemes in which case, each wchar_t element would correspond to a byte/octet of either the UTF16-BE or UTF-16LE encoding schemes. On the other hand, the paper states:

> "A registered character encoding is a character encoding form in the IANA Character Sets registry."

      o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is >= 8, architecture endianness is irrelevant since code units are a single byte.

That was a bit terse.  Ok, you mean an implementation that uses wchar_t same size as char
and puts wide literals in a sequence of byte-sized wchar_t items with UTF-16LE encoding.

Yes.

Note that code units are NOT a single byte (it's UTF-16, so code units are 16 bits,
but a byte can be 8 bits in this scenario).

Yes, my bad, a cut and paste bug.


It feels this is a particularly non-conforming implementation, because wchar_t can't
even hold a UTF-16 code unit (which needs 16-bit for storage).  I think the given
scenario is just out-of-scope for C++.

My intent was that wchar_t values correspond to bytes/octets as encoded with UTF-16LE here. If the current wchar_t restriction is lifted as suggested above, I believe this would be conforming and I would expect wide_literal() to return UTF16LE.

A similar concern can be illustrated with char:

Ordinary literal encoding is UTF-16, CHAR_BIT is >= 16, each char element is a code unit of the encoding form.
Ordinary literal encoding is UTF-16LE, CHAR_BIT is >= 8, each char element is a byte of the encoding scheme.

If we identify these as UTF16 and UTF16LE (as we should), then we aren't being consistent with regard to use of the IANA registered encodings as encoding schemes or encoding forms. How do we specify which encodings denote encoding schemes and which ones denote encoding forms? Neither the IANA registry nor the referenced RFCs are clear here, particularly for UTF16. Jens' answers above are the ones that we want, but I don't think the paper specifies that, nor provides rationale.

To be clear, I believe what we want is:

For UTF16, each char or wchar_t element corresponds to a code unit.
For UTF16LE and UTF16BE, each char or wchar_t element corresponds to a byte/octet.

The paper attempts to avoid these questions by stating this is all implementation-defined and that is probably fine; I'm asking these questions more to ensure the paper is clear in intent and wording and to ensure we're consistent with regard to programmers expectations.

Tom.