On Sat, Oct 2, 2021 at 12:55 AM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:

On 02/10/2021 00.14, Tom Honermann wrote:
> On 10/1/21 4:17 PM, Jens Maurer wrote:
>> On 01/10/2021 19.40, Tom Honermann via SG16 wrote:
>>> * How is the IANA registry intended to be applied? Which IANA encoding would be considered a match for each of the following cases?
>> My guess is we're specifically discussing the return value of the wide_literal()
>> function in the proposal.
> Yes.
>> None of the three cases below is describing a conforming implementation of (core language) C++
>> to start with, so these questions leave me confused as to their applicability to standardizing
>> something like P1885.
> For the moment, let's assume that we adopt a resolution for D2460R0 that allows the use of a variable length encoding for the wide literal encoding.
>> Assuming the core language restrictions are lifted (and the specification
>> interactions with C and the wide-character functions from C analyzed):
>>
>>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is >= 8, little endian architecture.
>> UTF16
> Why not UTF16LE? (I know why, but I'd like to hear what is offered for rationale).

- This is most consistent with (the absence of) differentiation for e.g. UCS-2 and UCS-4
and other similar wide encodings.
- There are already standard ways to determine the endianess of the platform,
which is (arguably) orthogonal to the question of encoding form.

>>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is >= 16, architecture endianness is irrelevant since code units are a single byte.
>> UTF16
>
> This is, of course, the right answer. But I've seen claims in some of the email threads that the IANA registered encodings correspond to encoding schemes in which case, each wchar_t element would correspond to a byte/octet of either the UTF16-BE or UTF-16LE encoding schemes. On the other hand, the paper states:

>> "A registered character encoding is a character encoding form in the IANA Character Sets registry."

The IANA registry is certainly confused in that it offers all of UTF16 and UTF16LE and UTF16BE
as alternatives. This smells like a category error.

In particular since other wide encodings shown (e.g. UCS-2 and UCS-4) don't show
such differentiation, although the endianess diversity obviously applies to them,
too. (Assuming the resulting byte sequence is the interesting property.)

>>> o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is >= 8, architecture endianness is irrelevant since code units are a single byte.
>> That was a bit terse. Ok, you mean an implementation that uses wchar_t same size as char
>> and puts wide literals in a sequence of byte-sized wchar_t items with UTF-16LE encoding.
> Yes.
>> Note that code units are NOT a single byte (it's UTF-16, so code units are 16 bits,
>> but a byte can be 8 bits in this scenario).
> Yes, my bad, a cut and paste bug.

... and what is the question you actually wanted to ask?
I still don't get it.

>> It feels this is a particularly non-conforming implementation, because wchar_t can't
>> even hold a UTF-16 code unit (which needs 16-bit for storage). I think the given
>> scenario is just out-of-scope for C++.
>
> My intent was that wchar_t values correspond to bytes/octets as encoded with UTF-16LE here. If the current wchar_t restriction is lifted as suggested above, I believe this would be conforming and I would expect wide_literal() to return UTF16LE.

Even after lifting the restriction on wchar_t, I continue to believe that a single
wchar_t object should be able to hold a single code unit (not: code point) of the
encoding. The code units of UTF-16LE are still 16-bit quantities, so an 8-bit
wchar_t would not be conforming.

Two follow-on thoughts:

- It would seem odd to have a platform that uses one endianess for UTF-16 code units
and another one for the rest of the integers. If we do not admit such possibility,
we don't ever need UTF16BE or UTF16LE (because the endianess is implied by the
platform endianess).

- The preceding bullet applies to wide_literal() and friends, which exist on a
given platform. When considering files (streams of octets), there is no implied
platform endianess, and the differentiation UTF16LE vs. UTF16BE does make sense.

> A similar concern can be illustrated with char:
>
> * Ordinary literal encoding is UTF-16, CHAR_BIT is >= 16, each char element is a code unit of the encoding form.
> * Ordinary literal encoding is UTF-16LE, CHAR_BIT is >= 8, each char element is a byte of the encoding scheme.
>
> If we identify these as UTF16 and UTF16LE (as we should),

Again, I disagree. UTF-16LE has 16-bit code units, which don't fit into an 8-bit char,
so this is non-conforming. If you wish to define your own encoding that has 8-bit
code units created by a UTF16LE sequence, feel free to do so and label it
Tom16 or so.

Exactly

> then we aren't being consistent with regard to use of the IANA registered encodings as encoding schemes or encoding forms. How do we specify which encodings denote encoding schemes and which ones denote encoding forms?

And which ones should wide_literal() return?

If you construct a text_encoding object by hand,

like text_encoding("utf16") it denotes an encoding form.

The same is true for the literal functions. The fact that we ALSO know the endianness of the platforms makes it an encoding scheme, but the invariant is not maintained or implied

by the text_encoding object itself.

Now. utf16le/be are always encoding schemes, and a conforming implementation can return that if they want to. Is it useful for users?

> Neither the IANA registry nor the referenced RFCs are clear here, particularly for UTF16. Jens' answers above are the ones that we want, but I don't think the paper specifies that, nor provides rationale.
>
> To be clear, I believe what we want is:
>
> * For UTF16, each char or wchar_t element corresponds to a code unit.
> * For UTF16LE and UTF16BE, each char or wchar_t element corresponds to a byte/octet.

An intermediate stage of discussion with Hubert was that the implementation
is supposed to (always) return encoding names that fully specify the width an
endianess, so UTF16 would never be returned, but just UTF16BE and UTF16LE.
For UCS-4, we'd need to invent UCS4LE and UCS4BE and UCS4VAX.

This would more directly map to the expected use-case calling iconv,
which always takes a sequence of bytes.
> The paper attempts to avoid these questions by stating this is all implementation-defined and that is probably fine; I'm asking these questions more to ensure the paper is clear in intent and wording and to ensure we're consistent with regard to programmers expectations.

I understand we can require very little in this area normatively
(except probably the handling of Unicode), but we should nonetheless
agree on and give clear guidance what implementations should do.
Otherwise, we'll just get different return values from different
compilers on the same platform, which helps nobody.

Agreed (as long as we keep that manageable, there is no bottom to they abyss and we are dangerously close from falling into it)

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16