On Wed, Oct 6, 2021 at 5:24 PM Tom Honermann <tom@honermann.net> wrote:

On 10/6/21 11:05 AM, Corentin Jabot wrote:

On Wed, Oct 6, 2021 at 4:53 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 06/10/2021 16.42, Corentin Jabot wrote:
>
>
> On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>

> I'm trying to understand how the IANA table, the specific values in that table,
> the encodings those values represent, the use of "encoding form" vs. "encoding
> scheme", and the use of integers (not octets) to initialize wchar_t's all fit
> together. So far, there is friction that we need to resolve, in my view.
>
>
> There is wording that Hubert asks for that says that how these things relate is implementation defined.

And I think that's not helpful for portable code.

> A non-hostile implementation would return a registered encoding that has a code unit size of CHAR_BITS for narrow function, and a registered encoding that has a code unit size of sizeof(wchar_t) for wide functions (if it exists). The byte order of wide string literal is platform specific and P1885 has no bearing on that. P1885 also does not affect how wchar_t represents values.
> IANA does not specify a byte order in the general case (merely that there is one), so we are not running afoul of anything.
> And "encoding form" vs. "encoding scheme" is Unicode specific.

The question of "encoding form" vs. "encoding scheme" arises for any
wchar_t encoding in the context of the IANA table, but there appear
to be very few encodings specified as integers as opposed to
sequences of bytes.

More like 0

I'm curious how wchar_t is treated in a non-Unicode situation.
Even something like Big5 https://en.wikipedia.org/wiki/Big5
appears to be byte-based, not integer-based:

First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters)
Second byte 0x40 to 0x7e, 0xa1 to 0xfe

So, it seems to be a multibyte encoding, not a wide one.

Sure, because it predates unicode terminology. But the concept is the same.

A code unit is still 2 byes, these things cannot be further splitted. There is no character in big5 that is encoded as a single byte.

A UTF-16 code unit is also 2 bytes.

I disagree with that, at least in general. a UTF-16 code unit fits in a single byte when CHAR_BIT is >= 16.

Sure? Octet.

wchar_t is suitable to represent any encoding that represent a character in N bytes (or a sequences of N bytes), for N = sizeof(wchar_t)/CHAR_BITS

Once we lift the restriction in [basic.fundamental]p8, yes.

> The C++ specification and implementations produce and have expectations about strings.
> If the strings produced or the expectations match the description of a given existing known encoding, then this encoding is suitable to label the strings and expectations of the C++ program, otherwise it isn't.
> I'm really struggling to see where the contention is here.

The contention is that [lex.string] initializes wchar_t's with
(potentially large) integer values (which I understand to be
"encoding forms" in Unicode parlance), but the RFC accompanying
the IANA table says the encodings described there are octet-based
encodings, which I understand to be "encoding schemes" in
Unicode parlance.

Does the wording suggested by Hubert (of specifying we are talking about object representation) addresses your concern?

We are talking about initialized strings, not what they have been initialized with.

I think the distinction between object representation and sequence of string elements remains a point of contention. Resolving this will be a goal of our meeting today.

Please keep in mind that iconv and other interfaces, like QTextDecoder always convert between sequences of bytes, if that is an use case we think is important,

then caring about the value is not enough. and we want to discourage 0 padding.

Tom.