sg16: Re: [SG16] Structure of EBCDIC MBCS and wide EBCDIC

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 7 Oct 2021 11:30:40 -0400

On Thu, Oct 7, 2021 at 3:21 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 07/10/2021 03.24, Hubert Tong via SG16 wrote:
> > For information, since interest was expressed in today's meeting.
> >
> > Wide characters are mostly a C/C++ invention. For EBCDIC encodings that
> do not have multibyte characters, the wide encoding of a character consists
> of the unsigned char value of the character in a wchar_t.
> >
> > EBCDIC also has multibyte encodings. These are formed by pairing
> single-byte encodings and double-byte encodings. The unification of
> single-byte and double-byte encodings into a multibyte, stateful "narrow"
> encoding is achieved using shift-out/shift-in.
> >
> > The wide encoding of a character from a multibyte EBCDIC encoding is as
> described above for a character from the single-byte component encoding.
> For a character from the double-byte component encoding, the wide encoding
> of a character consists of the value obtained by using the first byte of
> the double-byte character as the upper 8 bits of a 16-bit value and the
> second byte as the lower 8 bits.
>
> So, we have a situation similar to UTF-16 here, I guess:
>
> The EBCDIC wide encoding uses 16-bit code units (integer values of
> type wchar_t).

Except wchar_t is 32 bits for 64-bit processes.

> I'm reading your text that "upper bits" means
> "bits in the 16-bit integer", which would then be mapped to
> the object representation in some hardware endianness-dependent
> way.
>

That's a point that I am not sure how to test. All implementations that I
am aware of only produce programs that operate in big endian environments.
It is potentially an open design decision whether program or file
portability is more important. Note that a focus on file portability would
also mean that the single-byte EBCDIC encodings would have the narrow and
wide characters differ in value on little-endian systems. I believe it is
safe to say that there are more programs in general than programs that try
to interchange wide characters via byte streams/files.

>
> Can you point to IANA entries that designate EBCDIC wide encodings?
> (My guess is that the IANA entries designate multibyte shift-state
> encodings, not wide encodings, so maybe we should invent a
> recommendation what the derived naming convention should be.)
>

Your guess is correct. The paper's EBCDIC example is somewhat the invented
derived naming convention (although the last version I saw in the paper did
not indicate the width).

>
> We want UTF-16 in wchar_t to be represented by "UTF16" without LE/BE.
> By analogy, we should want wide-EBCDIC to be represented by some
> name that does not offer an opinion on hardware endianness.
>

Meaning that the invented names are meant to represent "native endian".
This appears consistent with the "program portability" direction.

>
> Another question: What happens if wchar_t is 32-bits on a CHAR_BITS==8
> platform and uses UTF-16, using two wchar_ts for some code points?
> If we go for the "object representation iconv-compatible" model,
> we can't say this is is UTF-16, because the object representation
> has stray 0 bytes.
>

Yes.

>
> So, it seems our implementation recommendations need to somehow
> express that "UTF16" can only be returned if sizeof(wchar_t) == 2.
> But it seems very plausible to also return "UTF16" if CHAR_BITS == 16
> and sizeof(wchar_t) == 1.
>

We could try to go with non-normative explanations of intent regarding I/O
serialization.

> We're not yet at the bottom of this, sorry.
>
> Jens
>

Received on 2021-10-07 10:31:39