sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 6 Oct 2021 17:25:38 +0200

On Wed, Oct 6, 2021 at 5:12 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 10/6/21 10:53 AM, Jens Maurer wrote:
>
> On 06/10/2021 16.42, Corentin Jabot wrote:
>
> On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <Jens.Maurer_at_[hidden]>> wrote:
>
>
> I'm trying to understand how the IANA table, the specific values in that table,
> the encodings those values represent, the use of "encoding form" vs. "encoding
> scheme", and the use of integers (not octets) to initialize wchar_t's all fit
> together. So far, there is friction that we need to resolve, in my view.
>
>
> There is wording that Hubert asks for that says that how these things relate is implementation defined.
>
> And I think that's not helpful for portable code.
>
>
> A non-hostile implementation would return a registered encoding that has a code unit size of CHAR_BITS for narrow function, and a registered encoding that has a code unit size of sizeof(wchar_t) for wide functions (if it exists). The byte order of wide string literal is platform specific and P1885 has no bearing on that. P1885 also does not affect how wchar_t represents values.
> IANA does not specify a byte order in the general case (merely that there is one), so we are not running afoul of anything.
> And "encoding form" vs. "encoding scheme" is Unicode specific.
>
> The question of "encoding form" vs. "encoding scheme" arises for any
> wchar_t encoding in the context of the IANA table, but there appear
> to be very few encodings specified as integers as opposed to
> sequences of bytes.
>
> I'm curious how wchar_t is treated in a non-Unicode situation.
> Even something like Big5 https://en.wikipedia.org/wiki/Big5
> appears to be byte-based, not integer-based:
>
> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters)
> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
>
> So, it seems to be a multibyte encoding, not a wide one.
>
> How do you distinguish a multibyte encoding from a wide one? Is it solely
> based on the current language in the standard ([basic.fundamental]p8
> <http://eel.is/c++draft/basic.fundamental#8>) that requires that "The
> values of type wchar_t can represent distinct codes for all members of the
> largest extended character set specified among the supported locales
> ([locale])."
>
They are orthogonal notions.

Multibyte encoding is a misnomer for things that have a variable number of
encodings (utf-8). double bytes encodings are encodings
where no character is encoded in a number of bytes that is not a multiple
of 2 (Big5, GB 2312, UTF-16), although terminology and exact definition
varry.
In any case the distinction is the smallest non-divisible unit of
information.
Wide encodings are a c++ invention whose definition depend on the size of
wchar_t

> If we lift that restriction, then I don't see reason that a multibyte
> encoding would not qualify as a wide encoding; particularly in the case
> where sizeof(wchar_t) == 1.
>
> The C++ specification and implementations produce and have expectations about strings.
> If the strings produced or the expectations match the description of a given existing known encoding, then this encoding is suitable to label the strings and expectations of the C++ program, otherwise it isn't.
> I'm really struggling to see where the contention is here.
>
> The contention is that [lex.string] initializes wchar_t's with
> (potentially large) integer values (which I understand to be
> "encoding forms" in Unicode parlance), but the RFC accompanying
> the IANA table says the encodings described there are octet-based
> encodings, which I understand to be "encoding schemes" in
> Unicode parlance.
>
> This matches my understanding as well, but nothing prevents the
> (potentially large) integer values vs octet distinction from applying to
> char as well when CHAR_BIT is suitably large.
>
> Tom.
>

Received on 2021-10-06 10:25:52