Date: Wed, 6 Oct 2021 16:42:24 +0200
On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 06/10/2021 15.23, Corentin Jabot wrote:
> >
> >
> > On Wed, Oct 6, 2021 at 2:49 PM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> > On 06/10/2021 13.24, Peter Brett wrote:
> > > Well, don't keep us in suspense, Jens.
> > >
> > > What *does* ISO 10646 define as the UTF-16 encoding scheme?
> >
> > BOM galore, default is big-endian:
> >
> >
> > 11.5 UTF-16
> >
> > The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by
> ordering octets in a way that either the
> > less significant octet precedes or follows the more significant
> octet.
> > In the UTF-16 encoding scheme, the initial signature read as <FE FF>
> indicates that the more significant octet
> > precedes the less significant octet, and <FF FE> the reverse. The
> signature is not part of the textual data.
> > In the absence of signature, the octet order of the UTF-16 encoding
> scheme is that the more significant octet
> > precedes the less significant octet.
> >
> >
> > Context matters, Jens.
> >
> > The endianness of encodings is conveyed by the platform.
> > BOMS are irrelevant in the scenarios of wide_literal and
> wide_environment.
> > The distinction between encoding form and encoding scheme is also
> irrelevant.
> > We can call them either way as long as we understand that the endianness
> is not part of text_encoding's invariant.
> > In any case, the distinction is not useful to users in the scenario of
> wide_literal/wide_environment, there are more suited
> > apis to deal with endianness. If you wanted to communicate a
> text_encoding object, then (and only then) would the distinction
> > become useful - and you would have to use UTF16-LE or transmit/store
> endianness information encoding objects along the text_encoding.
> >
> > An hypothetical system where the environment encoding would not use the
> same endianness as the rest of the environment could want to specify
> > UTF-16LE/BE. For example. or return unknown for user encodings.
> >
> > We are not dealing with arbitrary data, nor streams, nor networks here.
> > We are also not trying to describe any possible hypothetical scenario,
> merely to label known scenarios in a way that is useful for users.
>
> That's all fine, but just means using "encoding scheme" (and thus the
> IANA table, which presumably discusses encoding schemes) is not what we
> want, at least in some cases.
>
> > Unicode says this
> >> The UTF-16 encoding scheme may or may not begin with a BOM. However,
> when there is no BOM, and *in the absence of a higher-level protocol *the
> byte order of the UTF-16 encoding scheme is big-endian.
>
> I'm not finding that sentence in ISO 10646, our normative reference.
> Do you have a pointer to a section therein, please?
> In the absence of that, the normative words in ISO 10646 govern,
> and they don't talk about a "higher-level protocol".
>
> > There is a higher level protocol here: The C++ abstract machine.
> > So, implementations should return UTF16 (the assumption being that is
> less surprising to users), but can return UTF16LE/BE.
> > To disagree with that one should have to prove that making the
> distinction is more useful to developers of portable applications.
>
> I'm trying to understand how the IANA table, the specific values in that
> table,
> the encodings those values represent, the use of "encoding form" vs.
> "encoding
> scheme", and the use of integers (not octets) to initialize wchar_t's all
> fit
> together. So far, there is friction that we need to resolve, in my view.
>
There is wording that Hubert asks for that says that how these things
relate is implementation defined.
A non-hostile implementation would return a registered encoding that has a
code unit size of CHAR_BITS for narrow function, and a registered encoding
that has a code unit size of sizeof(wchar_t) for wide functions (if it
exists). The byte order of wide string literal is platform specific and
P1885 has no bearing on that. P1885 also does not affect how wchar_t
represents values.
IANA does not specify a byte order in the general case (merely that there
is one), so we are not running afoul of anything.
And "encoding form" vs. "encoding scheme" is Unicode specific.
The C++ specification and implementations produce and have expectations
about strings.
If the strings produced or the expectations match the description of a
given existing known encoding, then this encoding is suitable to label the
strings and expectations of the C++ program, otherwise it isn't.
I'm really struggling to see where the contention is here.
>
> > I don't care terribly (most utf-16 environments are LE), as long as we
> make up our mind, but I'm not sure there is much value in spending so much
> time on this.
> > But please, keep in mind the context, IANA has to deal with data
> exchange over the network, so does ISO 10646 and Unicode. We don't.
>
> So, maybe the IANA table isn't it, then.
>
That they have more scenarios to cater to does not mean it's not suitable.
>
> Jens
>
> On 06/10/2021 15.23, Corentin Jabot wrote:
> >
> >
> > On Wed, Oct 6, 2021 at 2:49 PM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> > On 06/10/2021 13.24, Peter Brett wrote:
> > > Well, don't keep us in suspense, Jens.
> > >
> > > What *does* ISO 10646 define as the UTF-16 encoding scheme?
> >
> > BOM galore, default is big-endian:
> >
> >
> > 11.5 UTF-16
> >
> > The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by
> ordering octets in a way that either the
> > less significant octet precedes or follows the more significant
> octet.
> > In the UTF-16 encoding scheme, the initial signature read as <FE FF>
> indicates that the more significant octet
> > precedes the less significant octet, and <FF FE> the reverse. The
> signature is not part of the textual data.
> > In the absence of signature, the octet order of the UTF-16 encoding
> scheme is that the more significant octet
> > precedes the less significant octet.
> >
> >
> > Context matters, Jens.
> >
> > The endianness of encodings is conveyed by the platform.
> > BOMS are irrelevant in the scenarios of wide_literal and
> wide_environment.
> > The distinction between encoding form and encoding scheme is also
> irrelevant.
> > We can call them either way as long as we understand that the endianness
> is not part of text_encoding's invariant.
> > In any case, the distinction is not useful to users in the scenario of
> wide_literal/wide_environment, there are more suited
> > apis to deal with endianness. If you wanted to communicate a
> text_encoding object, then (and only then) would the distinction
> > become useful - and you would have to use UTF16-LE or transmit/store
> endianness information encoding objects along the text_encoding.
> >
> > An hypothetical system where the environment encoding would not use the
> same endianness as the rest of the environment could want to specify
> > UTF-16LE/BE. For example. or return unknown for user encodings.
> >
> > We are not dealing with arbitrary data, nor streams, nor networks here.
> > We are also not trying to describe any possible hypothetical scenario,
> merely to label known scenarios in a way that is useful for users.
>
> That's all fine, but just means using "encoding scheme" (and thus the
> IANA table, which presumably discusses encoding schemes) is not what we
> want, at least in some cases.
>
> > Unicode says this
> >> The UTF-16 encoding scheme may or may not begin with a BOM. However,
> when there is no BOM, and *in the absence of a higher-level protocol *the
> byte order of the UTF-16 encoding scheme is big-endian.
>
> I'm not finding that sentence in ISO 10646, our normative reference.
> Do you have a pointer to a section therein, please?
> In the absence of that, the normative words in ISO 10646 govern,
> and they don't talk about a "higher-level protocol".
>
> > There is a higher level protocol here: The C++ abstract machine.
> > So, implementations should return UTF16 (the assumption being that is
> less surprising to users), but can return UTF16LE/BE.
> > To disagree with that one should have to prove that making the
> distinction is more useful to developers of portable applications.
>
> I'm trying to understand how the IANA table, the specific values in that
> table,
> the encodings those values represent, the use of "encoding form" vs.
> "encoding
> scheme", and the use of integers (not octets) to initialize wchar_t's all
> fit
> together. So far, there is friction that we need to resolve, in my view.
>
There is wording that Hubert asks for that says that how these things
relate is implementation defined.
A non-hostile implementation would return a registered encoding that has a
code unit size of CHAR_BITS for narrow function, and a registered encoding
that has a code unit size of sizeof(wchar_t) for wide functions (if it
exists). The byte order of wide string literal is platform specific and
P1885 has no bearing on that. P1885 also does not affect how wchar_t
represents values.
IANA does not specify a byte order in the general case (merely that there
is one), so we are not running afoul of anything.
And "encoding form" vs. "encoding scheme" is Unicode specific.
The C++ specification and implementations produce and have expectations
about strings.
If the strings produced or the expectations match the description of a
given existing known encoding, then this encoding is suitable to label the
strings and expectations of the C++ program, otherwise it isn't.
I'm really struggling to see where the contention is here.
>
> > I don't care terribly (most utf-16 environments are LE), as long as we
> make up our mind, but I'm not sure there is much value in spending so much
> time on this.
> > But please, keep in mind the context, IANA has to deal with data
> exchange over the network, so does ISO 10646 and Unicode. We don't.
>
> So, maybe the IANA table isn't it, then.
>
That they have more scenarios to cater to does not mean it's not suitable.
>
> Jens
>
Received on 2021-10-06 09:42:38