sg16: Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 20 Oct 2021 11:31:40 +0200

On 20/10/2021 07.06, Corentin wrote:
>
>
> On Wed, Oct 20, 2021, 00:20 Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 19/10/2021 23.09, Corentin wrote:
> > On Tue, Oct 19, 2021 at 10:38 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
> >
> >
> > Essentially agreed with your points;
> > I notice that talking about "encoding form"
> > (not encoding scheme and object representation)
> > answers a lot of the questions "naturally" and
> > seems to do what we want
> >
> >
> > This is contradicting previous polls and conclusions that IANA describe encoding schemes,
> > and users only care about encoding schemes.
>
> For ordinary strings, encoding forms and encoding schemes are indistinguishable.
> For wide strings, we're investing a lot of specification effort to level out
> the details of encoding scheme to essentially get back to encoding form.
> I'm arguing we should be doing this for all wide encodings, not just for
> UTF-16/32 in particular, or restrict the return value to UTF-16/32 to dodge
> the question.
>
> > Cf SG-16 minutes of previous meeting.
> > "Encoding form" only applies to UTF encoding anyway.
>
> I think the concept does apply to non-UTF wide encodings as well,
> but we have very little information about those.
>
> > I will not change this wording, given past polls.
>
> There might be new information that warrants asking the poll
> questions again.
>
> > And again, we have no experience with text handling on platform that have CHAR_BITS != 8,
> > so we can either have it return unknown, or let to implementers to decide whether their string
> > encodings match that of registered encoding (By saying nothing, which is the status quo),
> > rather than trying to force a definition that will not match standard practice (which would then force implementers to return unknown anyway).
>
> Ideally, we should strive to formulate the general guidance so that
> we get at least a plausible outcome for CHAR_BIT > 8.
> Talking about code units (= encoding forms) seems to make that easier
> and avoids any additional questions when byte != octet.
>
>
> UTF-16 is however specified to be 2 octets, same for all double width encoding. And we established users only care about representation.

Except that "representation" is bytes in C++, not octets.

What exactly does iconv do with the encoding "UTF-16" ?
According to the source code,
https://sourceware.org/git/?p=glibc.git;a=blob;f=iconvdata/utf-16.c;h=13a2a056b7175c937439cccf110b99f0684e0f9c;hb=HEAD
(line 52), it does perform BOM interpretation
(and appears to assume native endianness otherwise, in violation
of ISO 10646 (which specifies big endian as the default)).

It seems we've already partly given up on iconv pluggability,
because we require the user to change UTF-16 to UTF-16LE/BE depending
on the platform endianness to sidestep any BOM interpretation,
for example for the string literal

"\uffef something"

Jens

Received on 2021-10-20 04:31:47