Date: Wed, 20 Oct 2021 07:06:35 +0200
On Wed, Oct 20, 2021, 00:20 Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 19/10/2021 23.09, Corentin wrote:
> > On Tue, Oct 19, 2021 at 10:38 PM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
> >
> >
> > Essentially agreed with your points;
> > I notice that talking about "encoding form"
> > (not encoding scheme and object representation)
> > answers a lot of the questions "naturally" and
> > seems to do what we want
> >
> >
> > This is contradicting previous polls and conclusions that IANA describe
> encoding schemes,
> > and users only care about encoding schemes.
>
> For ordinary strings, encoding forms and encoding schemes are
> indistinguishable.
> For wide strings, we're investing a lot of specification effort to level
> out
> the details of encoding scheme to essentially get back to encoding form.
> I'm arguing we should be doing this for all wide encodings, not just for
> UTF-16/32 in particular, or restrict the return value to UTF-16/32 to dodge
> the question.
>
> > Cf SG-16 minutes of previous meeting.
> > "Encoding form" only applies to UTF encoding anyway.
>
> I think the concept does apply to non-UTF wide encodings as well,
> but we have very little information about those.
>
> > I will not change this wording, given past polls.
>
> There might be new information that warrants asking the poll
> questions again.
>
> > And again, we have no experience with text handling on platform that
> have CHAR_BITS != 8,
> > so we can either have it return unknown, or let to implementers to
> decide whether their string
> > encodings match that of registered encoding (By saying nothing, which is
> the status quo),
> > rather than trying to force a definition that will not match standard
> practice (which would then force implementers to return unknown anyway).
>
> Ideally, we should strive to formulate the general guidance so that
> we get at least a plausible outcome for CHAR_BIT > 8.
> Talking about code units (= encoding forms) seems to make that easier
> and avoids any additional questions when byte != octet.
>
UTF-16 is however specified to be 2 octets, same for all double width
encoding. And we established users only care about representation.
If we switch to a form model (which we can't because we can possibly define
what that is for non Unicode encoding as no such definition exist) we loose
the iconv compatible property.
There is no new information here, we are switching back and forth between
scheme and form and getting nowhere.
I think it's worth realizing that we are trying to specify something in a
vacuum, and trying to be inventive.
Do we even know if double width encodings being used on DSPs is a non
theorical scenario?
I am interested in knowing how much SG16 cares about this particular use
case. Please consider polling that.
If we wanted to pursue char_bits == 16 support, we'd need to look at
existing practice, which none of us have done, and which i can't find any
documentation about.
> Jens
>
> On 19/10/2021 23.09, Corentin wrote:
> > On Tue, Oct 19, 2021 at 10:38 PM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
> >
> >
> > Essentially agreed with your points;
> > I notice that talking about "encoding form"
> > (not encoding scheme and object representation)
> > answers a lot of the questions "naturally" and
> > seems to do what we want
> >
> >
> > This is contradicting previous polls and conclusions that IANA describe
> encoding schemes,
> > and users only care about encoding schemes.
>
> For ordinary strings, encoding forms and encoding schemes are
> indistinguishable.
> For wide strings, we're investing a lot of specification effort to level
> out
> the details of encoding scheme to essentially get back to encoding form.
> I'm arguing we should be doing this for all wide encodings, not just for
> UTF-16/32 in particular, or restrict the return value to UTF-16/32 to dodge
> the question.
>
> > Cf SG-16 minutes of previous meeting.
> > "Encoding form" only applies to UTF encoding anyway.
>
> I think the concept does apply to non-UTF wide encodings as well,
> but we have very little information about those.
>
> > I will not change this wording, given past polls.
>
> There might be new information that warrants asking the poll
> questions again.
>
> > And again, we have no experience with text handling on platform that
> have CHAR_BITS != 8,
> > so we can either have it return unknown, or let to implementers to
> decide whether their string
> > encodings match that of registered encoding (By saying nothing, which is
> the status quo),
> > rather than trying to force a definition that will not match standard
> practice (which would then force implementers to return unknown anyway).
>
> Ideally, we should strive to formulate the general guidance so that
> we get at least a plausible outcome for CHAR_BIT > 8.
> Talking about code units (= encoding forms) seems to make that easier
> and avoids any additional questions when byte != octet.
>
UTF-16 is however specified to be 2 octets, same for all double width
encoding. And we established users only care about representation.
If we switch to a form model (which we can't because we can possibly define
what that is for non Unicode encoding as no such definition exist) we loose
the iconv compatible property.
There is no new information here, we are switching back and forth between
scheme and form and getting nowhere.
I think it's worth realizing that we are trying to specify something in a
vacuum, and trying to be inventive.
Do we even know if double width encodings being used on DSPs is a non
theorical scenario?
I am interested in knowing how much SG16 cares about this particular use
case. Please consider polling that.
If we wanted to pursue char_bits == 16 support, we'd need to look at
existing practice, which none of us have done, and which i can't find any
documentation about.
> Jens
>
Received on 2021-10-20 00:06:49