C++ Logo

sg16

Advanced search

Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 6 Oct 2021 16:53:16 +0200
On 06/10/2021 16.42, Corentin Jabot wrote:
>
>
> On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>

> I'm trying to understand how the IANA table, the specific values in that table,
> the encodings those values represent, the use of "encoding form" vs. "encoding
> scheme", and the use of integers (not octets) to initialize wchar_t's all fit
> together. So far, there is friction that we need to resolve, in my view.
>
>
> There is wording that Hubert asks for that says that how these things relate is implementation defined.

And I think that's not helpful for portable code.

> A non-hostile implementation would return a registered encoding that has a code unit size of CHAR_BITS for narrow function, and a registered encoding that has a code unit size of sizeof(wchar_t) for wide functions (if it exists). The byte order of wide string literal is platform specific and P1885 has no bearing on that. P1885 also does not affect how wchar_t represents values.
> IANA does not specify a byte order in the general case (merely that there is one), so we are not running afoul of anything.
> And "encoding form" vs. "encoding scheme" is Unicode specific.

The question of "encoding form" vs. "encoding scheme" arises for any
wchar_t encoding in the context of the IANA table, but there appear
to be very few encodings specified as integers as opposed to
sequences of bytes.

I'm curious how wchar_t is treated in a non-Unicode situation.
Even something like Big5 https://en.wikipedia.org/wiki/Big5
appears to be byte-based, not integer-based:

First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters)
Second byte 0x40 to 0x7e, 0xa1 to 0xfe

So, it seems to be a multibyte encoding, not a wide one.

> The C++ specification and implementations produce and have expectations about strings.
> If the strings produced or the expectations match the description of a given existing known encoding, then this encoding is suitable to label the strings and expectations of the C++ program, otherwise it isn't.
> I'm really struggling to see where the contention is here.

The contention is that [lex.string] initializes wchar_t's with
(potentially large) integer values (which I understand to be
"encoding forms" in Unicode parlance), but the RFC accompanying
the IANA table says the encodings described there are octet-based
encodings, which I understand to be "encoding schemes" in
Unicode parlance.

Jens

Received on 2021-10-06 09:53:28