sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 6 Oct 2021 11:12:38 -0400

On 10/6/21 10:53 AM, Jens Maurer wrote:
> On 06/10/2021 16.42, Corentin Jabot wrote:
>>
>> On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>
>> I'm trying to understand how the IANA table, the specific values in that table,
>> the encodings those values represent, the use of "encoding form" vs. "encoding
>> scheme", and the use of integers (not octets) to initialize wchar_t's all fit
>> together. So far, there is friction that we need to resolve, in my view.
>>
>>
>> There is wording that Hubert asks for that says that how these things relate is implementation defined.
> And I think that's not helpful for portable code.
>
>> A non-hostile implementation would return a registered encoding that has a code unit size of CHAR_BITS for narrow function, and a registered encoding that has a code unit size of sizeof(wchar_t) for wide functions (if it exists). The byte order of wide string literal is platform specific and P1885 has no bearing on that. P1885 also does not affect how wchar_t represents values.
>> IANA does not specify a byte order in the general case (merely that there is one), so we are not running afoul of anything.
>> And "encoding form" vs. "encoding scheme" is Unicode specific.
> The question of "encoding form" vs. "encoding scheme" arises for any
> wchar_t encoding in the context of the IANA table, but there appear
> to be very few encodings specified as integers as opposed to
> sequences of bytes.
>
> I'm curious how wchar_t is treated in a non-Unicode situation.
> Even something like Big5 https://en.wikipedia.org/wiki/Big5
> appears to be byte-based, not integer-based:
>
> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters)
> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
>
> So, it seems to be a multibyte encoding, not a wide one.

How do you distinguish a multibyte encoding from a wide one? Is it
solely based on the current language in the standard
([basic.fundamental]p8 <http://eel.is/c++draft/basic.fundamental#8>)
that requires that "The values of type wchar_t can represent distinct
codes for all members of the largest extended character set specified
among the supported locales ([locale])."

If we lift that restriction, then I don't see reason that a multibyte
encoding would not qualify as a wide encoding; particularly in the case
where sizeof(wchar_t) == 1.

>
>> The C++ specification and implementations produce and have expectations about strings.
>> If the strings produced or the expectations match the description of a given existing known encoding, then this encoding is suitable to label the strings and expectations of the C++ program, otherwise it isn't.
>> I'm really struggling to see where the contention is here.
> The contention is that [lex.string] initializes wchar_t's with
> (potentially large) integer values (which I understand to be
> "encoding forms" in Unicode parlance), but the RFC accompanying
> the IANA table says the encodings described there are octet-based
> encodings, which I understand to be "encoding schemes" in
> Unicode parlance.

This matches my understanding as well, but nothing prevents the
(potentially large) integer values vs octet distinction from applying to
char as well when CHAR_BIT is suitably large.

Tom.

Received on 2021-10-06 10:12:42