On Thu, Oct 7, 2021 at 3:21 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 07/10/2021 03.24, Hubert Tong via SG16 wrote:
> For information, since interest was expressed in today's meeting.
>
> Wide characters are mostly a C/C++ invention. For EBCDIC encodings that do not have multibyte characters, the wide encoding of a character consists of the unsigned char value of the character in a wchar_t.
>
> EBCDIC also has multibyte encodings. These are formed by pairing single-byte encodings and double-byte encodings. The unification of single-byte and double-byte encodings into a multibyte, stateful "narrow" encoding is achieved using shift-out/shift-in.
>
> The wide encoding of a character from a multibyte EBCDIC encoding is as described above for a character from the single-byte component encoding. For a character from the double-byte component encoding, the wide encoding of a character consists of the value obtained by using the first byte of the double-byte character as the upper 8 bits of a 16-bit value and the second byte as the lower 8 bits.

So, we have a situation similar to UTF-16 here, I guess:

The EBCDIC wide encoding uses 16-bit code units (integer values of
type wchar_t).

Except wchar_t is 32 bits for 64-bit processes.

I'm reading your text that "upper bits" means
"bits in the 16-bit integer", which would then be mapped to
the object representation in some hardware endianness-dependent
way.

That's a point that I am not sure how to test. All implementations that I am aware of only produce programs that operate in big endian environments. It is potentially an open design decision whether program or file portability is more important. Note that a focus on file portability would also mean that the single-byte EBCDIC encodings would have the narrow and wide characters differ in value on little-endian systems. I believe it is safe to say that there are more programs in general than programs that try to interchange wide characters via byte streams/files.

Can you point to IANA entries that designate EBCDIC wide encodings?
(My guess is that the IANA entries designate multibyte shift-state
encodings, not wide encodings, so maybe we should invent a
recommendation what the derived naming convention should be.)

Your guess is correct. The paper's EBCDIC example is somewhat the invented derived naming convention (although the last version I saw in the paper did not indicate the width).

We want UTF-16 in wchar_t to be represented by "UTF16" without LE/BE.
By analogy, we should want wide-EBCDIC to be represented by some
name that does not offer an opinion on hardware endianness.

Meaning that the invented names are meant to represent "native endian". This appears consistent with the "program portability" direction.

Another question: What happens if wchar_t is 32-bits on a CHAR_BITS==8
platform and uses UTF-16, using two wchar_ts for some code points?
If we go for the "object representation iconv-compatible" model,
we can't say this is is UTF-16, because the object representation
has stray 0 bytes.

Yes.

So, it seems our implementation recommendations need to somehow
express that "UTF16" can only be returned if sizeof(wchar_t) == 2.
But it seems very plausible to also return "UTF16" if CHAR_BITS == 16
and sizeof(wchar_t) == 1.

We could try to go with non-normative explanations of intent regarding I/O serialization.

We're not yet at the bottom of this, sorry.

Jens