C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 20 Oct 2021 00:19:58 +0200
On 19/10/2021 23.09, Corentin wrote:
> On Tue, Oct 19, 2021 at 10:38 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
>
> Essentially agreed with your points;
> I notice that talking about "encoding form"
> (not encoding scheme and object representation)
> answers a lot of the questions "naturally" and
> seems to do what we want
>
>
> This is contradicting previous polls and conclusions that IANA describe encoding schemes,
> and users only care about encoding schemes.

For ordinary strings, encoding forms and encoding schemes are indistinguishable.
For wide strings, we're investing a lot of specification effort to level out
the details of encoding scheme to essentially get back to encoding form.
I'm arguing we should be doing this for all wide encodings, not just for
UTF-16/32 in particular, or restrict the return value to UTF-16/32 to dodge
the question.

> Cf SG-16 minutes of previous meeting.
> "Encoding form" only applies to UTF encoding anyway.

I think the concept does apply to non-UTF wide encodings as well,
but we have very little information about those.

> I will not change this wording, given past polls.

There might be new information that warrants asking the poll
questions again.

> And again, we have no experience with text handling on platform that have CHAR_BITS != 8,
> so we can either have it return unknown, or let to implementers to decide whether their string
> encodings match that of registered encoding (By saying nothing, which is the status quo),
> rather than trying to force a definition that will not match standard practice (which would then force implementers to return unknown anyway).

Ideally, we should strive to formulate the general guidance so that
we get at least a plausible outcome for CHAR_BIT > 8.
Talking about code units (= encoding forms) seems to make that easier
and avoids any additional questions when byte != octet.

Jens

Received on 2021-10-19 17:20:08