On Wed, Oct 20, 2021, 00:20 Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 19/10/2021 23.09, Corentin wrote:
> On Tue, Oct 19, 2021 at 10:38 PM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>
>
>     Essentially agreed with your points;
>     I notice that talking about "encoding form"
>     (not encoding scheme and object representation)
>     answers a lot of the questions "naturally" and
>     seems to do what we want
>
>
> This is contradicting previous polls and conclusions that IANA describe encoding schemes,
> and users only care about encoding schemes.

For ordinary strings, encoding forms and encoding schemes are indistinguishable.
For wide strings, we're investing a lot of specification effort to level out
the details of encoding scheme to essentially get back to encoding form.
I'm arguing we should be doing this for all wide encodings, not just for
UTF-16/32 in particular, or restrict the return value to UTF-16/32 to dodge
the question.

>   Cf SG-16 minutes of previous meeting.
> "Encoding form" only applies to UTF encoding anyway.

I think the concept does apply to non-UTF wide encodings as well,
but we have very little information about those.

> I will not change this wording, given past polls.

There might be new information that warrants asking the poll
questions again.

> And again, we have no experience with text handling on platform that have CHAR_BITS != 8,
> so we can either have it return unknown, or let to implementers to decide whether their string
> encodings match that of registered encoding (By saying nothing, which is the status quo),
> rather than trying to force a definition that will not match standard practice (which would then force implementers to return unknown anyway).

Ideally, we should strive to formulate the general guidance so that
we get at least a plausible outcome for CHAR_BIT > 8.
Talking about code units (= encoding forms) seems to make that easier
and avoids any additional questions when byte != octet.

UTF-16 is however specified to be 2 octets, same for all double width encoding. And we established users only care about representation.
If we switch to a form model (which we can't because we can possibly define what that is for non Unicode encoding as no such definition exist) we loose the iconv compatible property.

There is no new information here, we are switching back and forth between scheme and form and getting nowhere.

I think it's worth realizing that we are trying to specify something in a vacuum, and trying to be inventive.
Do we even know if double width encodings being used on DSPs is a non theorical scenario?

I am interested in knowing how much SG16 cares about this particular use case. Please consider polling that.

If we wanted to pursue char_bits == 16 support, we'd need to look at existing practice, which none of us have done, and which i can't find any documentation about.



Jens