C++ Logo

sg16

Advanced search

Re: [SG16] P1885 polling

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 23 Sep 2021 14:16:08 +0200
On Thu, Sep 23, 2021 at 2:00 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 23/09/2021 13.10, Peter Brett wrote:
> >> -----Original Message-----
> >> From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer
> via SG16
> >> Sent: 23 September 2021 11:18
> >> To: Corentin <corentin.jabot_at_[hidden]>
> >
> >>> "Obviously broken" is a rather big claim in the absence of suitable
> >> alternatives.
> >>> I believe the use of the IANA registry is motivated by the paper and
> >> previous polls.
> >>
> >> I think the polls are are not sufficiently precise to argue for the case
> >> that what the IANA table describes (by implication) as a narrow encoding
> >> cannot be re-used to designate a wide encoding trivially derived from
> the
> >> narrow encoding.
> >>
> >
> > This is a good thing.
> >
> > If I:
> >
> > 1. Obtain the wide literal encoding, E, with the P1885 facility
> > 2. Obtain a wide string literal.
> > 3. Memory copy the string literal into a byte array.
> > 4. Ask an external library [1] "is this byte array validly encoded with
> this encoding, E".
>
> Ok, then the encoding value does need to represent the narrow/wide
> differentiation, and also the endianness on the platform, because
> a 16-bit wchar_t on a little-endian platform obviously yields
> different byte values than the same 16-bit wchar_t value on a
> big-endian platform.
>
> In particular the latter point is not obvious at all from the normative
> wording,
> because an implementation can reasonably expect that all the encoding
> specifies
> is the sequence of numbers in an array of wchar_t (that's actually how
> literal encoding is specified), and not how that maps to a sequence of
> bytes.
>

Yes, that's also one of the reasons I think it's not a great idea to try to
define some kind of narrow<-> wide "trivial" mapping,
and let implementations who do that document it in the way that matches
their platforms behavior.

>
> A few more thoughts here:
>
> - The IANA table has UTF-16BE and UTF-16LE, which is consistent with the
> "byte array" interpretation, but it also has UTF-16, which should thus
> never
> appear as a result value of wide_literal(). I'd suggest to make this
> explicit in the wording.
>

UTF-16 signals UTF-16BE/UTF-16LE depending on platform endianness


>
> - We already know the endianness of the platform, so having the wide
> encoding represent the platform endianess is redundant.
>

To make that clear:

A registered character encoding is a character encoding scheme in the IANA
Character Sets registry.

>
>
> Jens
>
>

Received on 2021-09-23 07:16:21