Date: Wed, 6 Oct 2021 17:05:45 +0200
On Wed, Oct 6, 2021 at 4:53 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 06/10/2021 16.42, Corentin Jabot wrote:
> >
> >
> > On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
> >
>
> > I'm trying to understand how the IANA table, the specific values in
> that table,
> > the encodings those values represent, the use of "encoding form" vs.
> "encoding
> > scheme", and the use of integers (not octets) to initialize
> wchar_t's all fit
> > together. So far, there is friction that we need to resolve, in my
> view.
> >
> >
> > There is wording that Hubert asks for that says that how these things
> relate is implementation defined.
>
> And I think that's not helpful for portable code.
>
> > A non-hostile implementation would return a registered encoding that has
> a code unit size of CHAR_BITS for narrow function, and a registered
> encoding that has a code unit size of sizeof(wchar_t) for wide functions
> (if it exists). The byte order of wide string literal is platform specific
> and P1885 has no bearing on that. P1885 also does not affect how wchar_t
> represents values.
> > IANA does not specify a byte order in the general case (merely that
> there is one), so we are not running afoul of anything.
> > And "encoding form" vs. "encoding scheme" is Unicode specific.
>
> The question of "encoding form" vs. "encoding scheme" arises for any
> wchar_t encoding in the context of the IANA table, but there appear
> to be very few encodings specified as integers as opposed to
> sequences of bytes.
>
More like 0
>
> I'm curious how wchar_t is treated in a non-Unicode situation.
> Even something like Big5 https://en.wikipedia.org/wiki/Big5
> appears to be byte-based, not integer-based:
>
> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for
> non-user-defined characters)
> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
> So, it seems to be a multibyte encoding, not a wide one.
>
Sure, because it predates unicode terminology. But the concept is the same.
A code unit is still 2 byes, these things cannot be further splitted. There
is no character in big5 that is encoded as a single byte.
A UTF-16 code unit is also 2 bytes.
wchar_t is suitable to represent any encoding that represent a character in
N bytes (or a sequences of N bytes), for N = sizeof(wchar_t)/CHAR_BITS
>
> > The C++ specification and implementations produce and have expectations
> about strings.
> > If the strings produced or the expectations match the description of a
> given existing known encoding, then this encoding is suitable to label the
> strings and expectations of the C++ program, otherwise it isn't.
> > I'm really struggling to see where the contention is here.
>
> The contention is that [lex.string] initializes wchar_t's with
> (potentially large) integer values (which I understand to be
> "encoding forms" in Unicode parlance), but the RFC accompanying
> the IANA table says the encodings described there are octet-based
> encodings, which I understand to be "encoding schemes" in
> Unicode parlance.
>
Does the wording suggested by Hubert (of specifying we are talking about
object representation) addresses your concern?
We are talking about initialized strings, not what they have been
initialized with.
>
> Jens
>
> On 06/10/2021 16.42, Corentin Jabot wrote:
> >
> >
> > On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
> >
>
> > I'm trying to understand how the IANA table, the specific values in
> that table,
> > the encodings those values represent, the use of "encoding form" vs.
> "encoding
> > scheme", and the use of integers (not octets) to initialize
> wchar_t's all fit
> > together. So far, there is friction that we need to resolve, in my
> view.
> >
> >
> > There is wording that Hubert asks for that says that how these things
> relate is implementation defined.
>
> And I think that's not helpful for portable code.
>
> > A non-hostile implementation would return a registered encoding that has
> a code unit size of CHAR_BITS for narrow function, and a registered
> encoding that has a code unit size of sizeof(wchar_t) for wide functions
> (if it exists). The byte order of wide string literal is platform specific
> and P1885 has no bearing on that. P1885 also does not affect how wchar_t
> represents values.
> > IANA does not specify a byte order in the general case (merely that
> there is one), so we are not running afoul of anything.
> > And "encoding form" vs. "encoding scheme" is Unicode specific.
>
> The question of "encoding form" vs. "encoding scheme" arises for any
> wchar_t encoding in the context of the IANA table, but there appear
> to be very few encodings specified as integers as opposed to
> sequences of bytes.
>
More like 0
>
> I'm curious how wchar_t is treated in a non-Unicode situation.
> Even something like Big5 https://en.wikipedia.org/wiki/Big5
> appears to be byte-based, not integer-based:
>
> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for
> non-user-defined characters)
> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
> So, it seems to be a multibyte encoding, not a wide one.
>
Sure, because it predates unicode terminology. But the concept is the same.
A code unit is still 2 byes, these things cannot be further splitted. There
is no character in big5 that is encoded as a single byte.
A UTF-16 code unit is also 2 bytes.
wchar_t is suitable to represent any encoding that represent a character in
N bytes (or a sequences of N bytes), for N = sizeof(wchar_t)/CHAR_BITS
>
> > The C++ specification and implementations produce and have expectations
> about strings.
> > If the strings produced or the expectations match the description of a
> given existing known encoding, then this encoding is suitable to label the
> strings and expectations of the C++ program, otherwise it isn't.
> > I'm really struggling to see where the contention is here.
>
> The contention is that [lex.string] initializes wchar_t's with
> (potentially large) integer values (which I understand to be
> "encoding forms" in Unicode parlance), but the RFC accompanying
> the IANA table says the encodings described there are octet-based
> encodings, which I understand to be "encoding schemes" in
> Unicode parlance.
>
Does the wording suggested by Hubert (of specifying we are talking about
object representation) addresses your concern?
We are talking about initialized strings, not what they have been
initialized with.
>
> Jens
>
Received on 2021-10-06 10:06:00