C++ Logo

sg16

Advanced search

Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 6 Oct 2021 15:23:24 +0200
On Wed, Oct 6, 2021 at 2:49 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 06/10/2021 13.24, Peter Brett wrote:
> > Well, don't keep us in suspense, Jens.
> >
> > What *does* ISO 10646 define as the UTF-16 encoding scheme?
>
> BOM galore, default is big-endian:
>
>
> 11.5 UTF-16
>
> The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by
> ordering octets in a way that either the
> less significant octet precedes or follows the more significant octet.
> In the UTF-16 encoding scheme, the initial signature read as <FE FF>
> indicates that the more significant octet
> precedes the less significant octet, and <FF FE> the reverse. The
> signature is not part of the textual data.
> In the absence of signature, the octet order of the UTF-16 encoding scheme
> is that the more significant octet
> precedes the less significant octet.
>

Context matters, Jens.

The endianness of encodings is conveyed by the platform.
BOMS are irrelevant in the scenarios of wide_literal and wide_environment.
The distinction between encoding form and encoding scheme is also
irrelevant.
We can call them either way as long as we understand that the endianness is
not part of text_encoding's invariant.
In any case, the distinction is not useful to users in the scenario of
wide_literal/wide_environment, there are more suited
apis to deal with endianness. If you wanted to communicate a text_encoding
object, then (and only then) would the distinction
become useful - and you would have to use UTF16-LE or transmit/store
endianness information encoding objects along the text_encoding.

An hypothetical system where the environment encoding would not use the
same endianness as the rest of the environment could want to specify
UTF-16LE/BE. For example. or return unknown for user encodings.

We are not dealing with arbitrary data, nor streams, nor networks here.
We are also not trying to describe any possible hypothetical scenario,
merely to label known scenarios in a way that is useful for users.

Unicode says this
> The UTF-16 encoding scheme may or may not begin with a BOM. However, when
there is no BOM, and *in the absence of a higher-level protocol *the byte
order of the UTF-16 encoding scheme is big-endian.

There is a higher level protocol here: The C++ abstract machine.
So, implementations should return UTF16 (the assumption being that is less
surprising to users), but can return UTF16LE/BE.
To disagree with that one should have to prove that making the distinction
is more useful to developers of portable applications.

I don't care terribly (most utf-16 environments are LE), as long as we make
up our mind, but I'm not sure there is much value in spending so much time
on this.
But please, keep in mind the context, IANA has to deal with data exchange
over the network, so does ISO 10646 and Unicode. We don't.



>
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-10-06 08:23:40