sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 6 Oct 2021 11:24:03 -0400

On 10/6/21 11:05 AM, Corentin Jabot wrote:
>
>
> On Wed, Oct 6, 2021 at 4:53 PM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 06/10/2021 16.42, Corentin Jabot wrote:
> >
> >
> > On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
> >
>
> > I'm trying to understand how the IANA table, the specific
> values in that table,
> > the encodings those values represent, the use of "encoding
> form" vs. "encoding
> > scheme", and the use of integers (not octets) to initialize
> wchar_t's all fit
> > together. So far, there is friction that we need to
> resolve, in my view.
> >
> >
> > There is wording that Hubert asks for that says that how these
> things relate is implementation defined.
>
> And I think that's not helpful for portable code.
>
> > A non-hostile implementation would return a registered encoding
> that has a code unit size of CHAR_BITS for narrow function, and a
> registered encoding that has a code unit size of sizeof(wchar_t)
> for wide functions (if it exists). The byte order of wide string
> literal is platform specific and P1885 has no bearing on that.
> P1885 also does not affect how wchar_t represents values.
> > IANA does not specify a byte order in the general case (merely
> that there is one), so we are not running afoul of anything.
> > And "encoding form" vs. "encoding scheme" is Unicode specific.
>
> The question of "encoding form" vs. "encoding scheme" arises for any
> wchar_t encoding in the context of the IANA table, but there appear
> to be very few encodings specified as integers as opposed to
> sequences of bytes.
>
>
> More like 0
>
>
> I'm curious how wchar_t is treated in a non-Unicode situation.
> Even something like Big5 https://en.wikipedia.org/wiki/Big5
> <https://en.wikipedia.org/wiki/Big5>
> appears to be byte-based, not integer-based:
>
> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for
> non-user-defined characters)
> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
>
>
> So, it seems to be a multibyte encoding, not a wide one.
>
>
> Sure, because it predates unicode terminology. But the concept is the
> same.
> A code unit is still 2 byes, these things cannot be further splitted.
> There is no character in big5 that is encoded as a single byte.
>
> A UTF-16 code unit is also 2 bytes.
I disagree with that, at least in general. a UTF-16 code unit fits in a
single byte when CHAR_BIT is >= 16.
> wchar_t is suitable to represent any encoding that represent a
> character in N bytes (or a sequences of N bytes), for N =
> sizeof(wchar_t)/CHAR_BITS
Once we lift the restriction in [basic.fundamental]p8
<http://eel.is/c++draft/basic.fundamental#8>, yes.
>
>
> > The C++ specification and implementations produce and have
> expectations about strings.
> > If the strings produced or the expectations match the
> description of a given existing known encoding, then this encoding
> is suitable to label the strings and expectations of the C++
> program, otherwise it isn't.
> > I'm really struggling to see where the contention is here.
>
> The contention is that [lex.string] initializes wchar_t's with
> (potentially large) integer values (which I understand to be
> "encoding forms" in Unicode parlance), but the RFC accompanying
> the IANA table says the encodings described there are octet-based
> encodings, which I understand to be "encoding schemes" in
> Unicode parlance.
>
>
> Does the wording suggested by Hubert (of specifying we are talking
> about object representation) addresses your concern?
> We are talking about initialized strings, not what they have been
> initialized with.

I think the distinction between object representation and sequence of
string elements remains a point of contention. Resolving this will be a
goal of our meeting today.

Tom.

Received on 2021-10-06 10:24:06