On 10/6/21 10:53 AM, Jens Maurer wrote:
On 06/10/2021 16.42, Corentin Jabot wrote:

On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:


      
    I'm trying to understand how the IANA table, the specific values in that table,
    the encodings those values represent, the use of "encoding form" vs. "encoding
    scheme", and the use of integers (not octets) to initialize wchar_t's all fit
    together.  So far, there is friction that we need to resolve, in my view.


There is wording that Hubert asks for that says that how these things relate is implementation defined.
And I think that's not helpful for portable code.

A non-hostile implementation would return a registered encoding that has a code unit size of CHAR_BITS for narrow function, and a registered encoding that has a code unit size of sizeof(wchar_t) for wide functions (if it exists). The byte order of wide string literal is platform specific and P1885 has no bearing on that. P1885 also does not affect how wchar_t represents values.
IANA does not specify a byte order in the general case (merely that there is one), so we are not running afoul of anything.
And "encoding form" vs. "encoding scheme" is Unicode specific.
The question of "encoding form" vs. "encoding scheme" arises for any
wchar_t encoding in the context of the IANA table, but there appear
to be very few encodings specified as integers as opposed to
sequences of bytes.

I'm curious how wchar_t is treated in a non-Unicode situation.
Even something like Big5  https://en.wikipedia.org/wiki/Big5
appears to be byte-based, not integer-based:

First byte ("lead byte") 	0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters)
Second byte 	0x40 to 0x7e, 0xa1 to 0xfe

So, it seems to be a multibyte encoding, not a wide one.

How do you distinguish a multibyte encoding from a wide one? Is it solely based on the current language in the standard ([basic.fundamental]p8) that requires that "The values of type wchar_­t can represent distinct codes for all members of the largest extended character set specified among the supported locales ([locale])."

If we lift that restriction, then I don't see reason that a multibyte encoding would not qualify as a wide encoding; particularly in the case where sizeof(wchar_t) == 1.


The C++ specification and implementations produce and have expectations about strings.
If the strings produced or the expectations match the description of a given existing known encoding, then this encoding is suitable to label the strings and expectations of the C++ program, otherwise it isn't.
I'm really struggling to see where the contention is here.
The contention is that [lex.string] initializes wchar_t's with
(potentially large) integer values (which I understand to be
"encoding forms" in Unicode parlance), but the RFC accompanying
the IANA table says the encodings described there are octet-based
encodings, which I understand to be "encoding schemes" in
Unicode parlance.

This matches my understanding as well, but nothing prevents the (potentially large) integer values vs octet distinction from applying to char as well when CHAR_BIT is suitably large.

Tom.