sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 6 Oct 2021 17:36:06 +0200

On 06/10/2021 17.05, Corentin Jabot wrote:
>
>
> On Wed, Oct 6, 2021 at 4:53 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 06/10/2021 16.42, Corentin Jabot wrote:
> >
> >
> > On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
> >
>
> > I'm trying to understand how the IANA table, the specific values in that table,
> > the encodings those values represent, the use of "encoding form" vs. "encoding
> > scheme", and the use of integers (not octets) to initialize wchar_t's all fit
> > together. So far, there is friction that we need to resolve, in my view.
> >
> >
> > There is wording that Hubert asks for that says that how these things relate is implementation defined.
>
> And I think that's not helpful for portable code.
>
> > A non-hostile implementation would return a registered encoding that has a code unit size of CHAR_BITS for narrow function, and a registered encoding that has a code unit size of sizeof(wchar_t) for wide functions (if it exists). The byte order of wide string literal is platform specific and P1885 has no bearing on that. P1885 also does not affect how wchar_t represents values.
> > IANA does not specify a byte order in the general case (merely that there is one), so we are not running afoul of anything.
> > And "encoding form" vs. "encoding scheme" is Unicode specific.
>
> The question of "encoding form" vs. "encoding scheme" arises for any
> wchar_t encoding in the context of the IANA table, but there appear
> to be very few encodings specified as integers as opposed to
> sequences of bytes.
>
>
> More like 0

Good to know.

> I'm curious how wchar_t is treated in a non-Unicode situation.
> Even something like Big5 https://en.wikipedia.org/wiki/Big5 <https://en.wikipedia.org/wiki/Big5>
> appears to be byte-based, not integer-based:
>
> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters)
> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
>
>
> So, it seems to be a multibyte encoding, not a wide one.
>
>
> Sure, because it predates unicode terminology. But the concept is the same.
> A code unit is still 2 byes, these things cannot be further splitted. There is no character in big5 that is encoded as a single byte.

Ok. Assume I have a Big5 wide literal encoding. I'm initializing a wide string
with a well-known character:

wchar_t ws[] = "<the character>";

What's the value of ws[0]? Note it's an integer value.

Does the value of ws[0] depend on the endianness of my platform?

> A UTF-16 code unit is also 2 bytes.
> wchar_t is suitable to represent any encoding that represent a character in N bytes (or a sequences of N bytes), for N = sizeof(wchar_t)/CHAR_BITS

Agreed.

> > The C++ specification and implementations produce and have expectations about strings.
> > If the strings produced or the expectations match the description of a given existing known encoding, then this encoding is suitable to label the strings and expectations of the C++ program, otherwise it isn't.
> > I'm really struggling to see where the contention is here.
>
> The contention is that [lex.string] initializes wchar_t's with
> (potentially large) integer values (which I understand to be
> "encoding forms" in Unicode parlance), but the RFC accompanying
> the IANA table says the encodings described there are octet-based
> encodings, which I understand to be "encoding schemes" in
> Unicode parlance.
>
>
> Does the wording suggested by Hubert (of specifying we are talking about object representation) addresses your concern?> We are talking about initialized strings, not what they have been initialized with.

Right, we're talking about the integer sequence that the initialized string
contains. If we're instead talking about the underlying object representation,
I start having questions about padding bits.

Jens

Received on 2021-10-06 10:36:17