Yes.On 01/10/2021 19.40, Tom Honermann via SG16 wrote:* How is the IANA registry intended to be applied? Which IANA encoding would be considered a match for each of the following cases?My guess is we're specifically discussing the return value of the wide_literal() function in the proposal.
For the moment, let's assume that we adopt a resolution for D2460R0 that allows the use of a variable length encoding for the wide literal encoding.None of the three cases below is describing a conforming implementation of (core language) C++ to start with, so these questions leave me confused as to their applicability to standardizing something like P1885.
Why not UTF16LE? (I know why, but I'd like to hear what is offered for rationale).Assuming the core language restrictions are lifted (and the specification interactions with C and the wide-character functions from C analyzed):o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is >= 8, little endian architecture.UTF16
o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is >= 16, architecture endianness is irrelevant since code units are a single byte.UTF16
This is, of course, the right answer. But I've seen claims in some of the email threads that the IANA registered encodings correspond to encoding schemes in which case, each wchar_t element would correspond to a byte/octet of either the UTF16-BE or UTF-16LE encoding schemes. On the other hand, the paper states:
> "A registered character encoding is a character encoding
form in the IANA Character Sets registry."
Yes.o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is >= 8, architecture endianness is irrelevant since code units are a single byte.That was a bit terse. Ok, you mean an implementation that uses wchar_t same size as char and puts wide literals in a sequence of byte-sized wchar_t items with UTF-16LE encoding.
Yes, my bad, a cut and paste bug.Note that code units are NOT a single byte (it's UTF-16, so code units are 16 bits, but a byte can be 8 bits in this scenario).
It feels this is a particularly non-conforming implementation, because wchar_t can't even hold a UTF-16 code unit (which needs 16-bit for storage). I think the given scenario is just out-of-scope for C++.
My intent was that wchar_t values
correspond to bytes/octets as encoded with UTF-16LE here. If the
current wchar_t restriction is
lifted as suggested above, I believe this would be conforming and
I would expect wide_literal() to
A similar concern can be illustrated with char:
If we identify these as UTF16 and UTF16LE (as we should), then we aren't being consistent with regard to use of the IANA registered encodings as encoding schemes or encoding forms. How do we specify which encodings denote encoding schemes and which ones denote encoding forms? Neither the IANA registry nor the referenced RFCs are clear here, particularly for UTF16. Jens' answers above are the ones that we want, but I don't think the paper specifies that, nor provides rationale.
To be clear, I believe what we want is:
The paper attempts to avoid these questions by stating this is
all implementation-defined and that is probably fine; I'm asking
these questions more to ensure the paper is clear in intent and
wording and to ensure we're consistent with regard to programmers