sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 6 Oct 2021 16:02:45 +0200

On 06/10/2021 15.23, Corentin Jabot wrote:
>
>
> On Wed, Oct 6, 2021 at 2:49 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 06/10/2021 13.24, Peter Brett wrote:
> > Well, don't keep us in suspense, Jens.
> >
> > What *does* ISO 10646 define as the UTF-16 encoding scheme?
>
> BOM galore, default is big-endian:
>
>
> 11.5 UTF-16
>
> The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by ordering octets in a way that either the
> less significant octet precedes or follows the more significant octet.
> In the UTF-16 encoding scheme, the initial signature read as <FE FF> indicates that the more significant octet
> precedes the less significant octet, and <FF FE> the reverse. The signature is not part of the textual data.
> In the absence of signature, the octet order of the UTF-16 encoding scheme is that the more significant octet
> precedes the less significant octet.
>
>
> Context matters, Jens.
>
> The endianness of encodings is conveyed by the platform.
> BOMS are irrelevant in the scenarios of wide_literal and wide_environment.
> The distinction between encoding form and encoding scheme is also irrelevant.
> We can call them either way as long as we understand that the endianness is not part of text_encoding's invariant.
> In any case, the distinction is not useful to users in the scenario of wide_literal/wide_environment, there are more suited
> apis to deal with endianness. If you wanted to communicate a text_encoding object, then (and only then) would the distinction
> become useful - and you would have to use UTF16-LE or transmit/store endianness information encoding objects along the text_encoding.
>
> An hypothetical system where the environment encoding would not use the same endianness as the rest of the environment could want to specify
> UTF-16LE/BE. For example. or return unknown for user encodings.
>
> We are not dealing with arbitrary data, nor streams, nor networks here.
> We are also not trying to describe any possible hypothetical scenario, merely to label known scenarios in a way that is useful for users.

That's all fine, but just means using "encoding scheme" (and thus the
IANA table, which presumably discusses encoding schemes) is not what we
want, at least in some cases.

> Unicode says this
>> The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and *in the absence of a higher-level protocol *the byte order of the UTF-16 encoding scheme is big-endian.

I'm not finding that sentence in ISO 10646, our normative reference.
Do you have a pointer to a section therein, please?
In the absence of that, the normative words in ISO 10646 govern,
and they don't talk about a "higher-level protocol".

> There is a higher level protocol here: The C++ abstract machine.
> So, implementations should return UTF16 (the assumption being that is less surprising to users), but can return UTF16LE/BE.
> To disagree with that one should have to prove that making the distinction is more useful to developers of portable applications.

I'm trying to understand how the IANA table, the specific values in that table,
the encodings those values represent, the use of "encoding form" vs. "encoding
scheme", and the use of integers (not octets) to initialize wchar_t's all fit
together. So far, there is friction that we need to resolve, in my view.

> I don't care terribly (most utf-16 environments are LE), as long as we make up our mind, but I'm not sure there is much value in spending so much time on this.
> But please, keep in mind the context, IANA has to deal with data exchange over the network, so does ISO 10646 and Unicode. We don't.

So, maybe the IANA table isn't it, then.

Jens

Received on 2021-10-06 09:02:57