C++ Logo

sg16

Advanced search

Re: [SG16] Structure of EBCDIC MBCS and wide EBCDIC

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 7 Oct 2021 09:21:40 +0200
On 07/10/2021 03.24, Hubert Tong via SG16 wrote:
> For information, since interest was expressed in today's meeting.
>
> Wide characters are mostly a C/C++ invention. For EBCDIC encodings that do not have multibyte characters, the wide encoding of a character consists of the unsigned char value of the character in a wchar_t.
>
> EBCDIC also has multibyte encodings. These are formed by pairing single-byte encodings and double-byte encodings. The unification of single-byte and double-byte encodings into a multibyte, stateful "narrow" encoding is achieved using shift-out/shift-in.
>
> The wide encoding of a character from a multibyte EBCDIC encoding is as described above for a character from the single-byte component encoding. For a character from the double-byte component encoding, the wide encoding of a character consists of the value obtained by using the first byte of the double-byte character as the upper 8 bits of a 16-bit value and the second byte as the lower 8 bits.

So, we have a situation similar to UTF-16 here, I guess:

The EBCDIC wide encoding uses 16-bit code units (integer values of
type wchar_t). I'm reading your text that "upper bits" means
"bits in the 16-bit integer", which would then be mapped to
the object representation in some hardware endianness-dependent
way.

Can you point to IANA entries that designate EBCDIC wide encodings?
(My guess is that the IANA entries designate multibyte shift-state
encodings, not wide encodings, so maybe we should invent a
recommendation what the derived naming convention should be.)

We want UTF-16 in wchar_t to be represented by "UTF16" without LE/BE.
By analogy, we should want wide-EBCDIC to be represented by some
name that does not offer an opinion on hardware endianness.

Another question: What happens if wchar_t is 32-bits on a CHAR_BITS==8
platform and uses UTF-16, using two wchar_ts for some code points?
If we go for the "object representation iconv-compatible" model,
we can't say this is is UTF-16, because the object representation
has stray 0 bytes.

So, it seems our implementation recommendations need to somehow
express that "UTF16" can only be returned if sizeof(wchar_t) == 2.
But it seems very plausible to also return "UTF16" if CHAR_BITS == 16
and sizeof(wchar_t) == 1.

We're not yet at the bottom of this, sorry.

Jens

Received on 2021-10-07 02:21:45