sg16: Re: [SG16] Structure of EBCDIC MBCS and wide EBCDIC

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 13 Oct 2021 21:05:14 -0400

Thank you, Jens and Hubert for this further discussion.

I think these are important points for the paper to address. However, I
don't think they materially affect the design intent, so I'm not
inclined to revisit the SG16 consensus. Please let me know if you feel
this is new information that warrants another trip through SG16.

Corentin, I suggest doing the following:

  * Add references to the relevant IBM documentation linked by Hubert.
    The paper currently includes a link to one of them, but it doesn't
    appear in the References section (IBM URLs don't tend to be
    particularly stable over time, so having the title and date included
    in the paper would be helpful).
  * Add some additional prose that demonstrates or describes the
    differences between the IBM SBCS, DBCS, and (stateful) MBCS
    encodings with the intent to illustrate why they represent distinct
    encodings for the purposes of P1885. For example, demonstrate how
    the underlying representation changes for each of the following (or
    a subset of them; it might be overkill to address all of them).
      o CCSID 00833, Korean, EBCDIC, SBCS
      o CCSID 00834, Korean, EBCDIC, DBCS
      o CCSID 00933, Korean, EBCDIC, MBCS
      o CCSID 00934, Korean, ASCII, MBCS
      o CCSID 01364, Korean, EBCDIC, MBCS
      o CCSID 04930, Korean, EBCDIC, DBCS
      o CCSID 21314, Korean, EBCDIC, DBCS
  * Add guidelines for registering wide encodings with IANA; e.g.,
    recommended naming conventions and native endian encodings
    (potentially in addition to BE/LE encodings that might be used for
    octet based interchange).
  * Add normative encouragement that, e.g., UTF-16 should not be
    returned for wide_literal() and wide_environment() when
    sizeof(wchar_t) is other than 1 or 2.

Apologies if some of this is already present in the latest revision. I
haven't re-read the entire paper.

Tom.

On 10/7/21 11:30 AM, Hubert Tong via SG16 wrote:
> On Thu, Oct 7, 2021 at 3:21 AM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 07/10/2021 03.24, Hubert Tong via SG16 wrote:
> > For information, since interest was expressed in today's meeting.
> >
> > Wide characters are mostly a C/C++ invention. For EBCDIC
> encodings that do not have multibyte characters, the wide encoding
> of a character consists of the unsigned char value of the
> character in a wchar_t.
> >
> > EBCDIC also has multibyte encodings. These are formed by pairing
> single-byte encodings and double-byte encodings. The unification
> of single-byte and double-byte encodings into a multibyte,
> stateful "narrow" encoding is achieved using shift-out/shift-in.
> >
> > The wide encoding of a character from a multibyte EBCDIC
> encoding is as described above for a character from the
> single-byte component encoding. For a character from the
> double-byte component encoding, the wide encoding of a character
> consists of the value obtained by using the first byte of the
> double-byte character as the upper 8 bits of a 16-bit value and
> the second byte as the lower 8 bits.
>
> So, we have a situation similar to UTF-16 here, I guess:
>
> The EBCDIC wide encoding uses 16-bit code units (integer values of
> type wchar_t).
>
>
> Except wchar_t is 32 bits for 64-bit processes.
>
> I'm reading your text that "upper bits" means
> "bits in the 16-bit integer", which would then be mapped to
> the object representation in some hardware endianness-dependent
> way.
>
>
> That's a point that I am not sure how to test. All implementations
> that I am aware of only produce programs that operate in big endian
> environments. It is potentially an open design decision whether
> program or file portability is more important. Note that a focus on
> file portability would also mean that the single-byte EBCDIC encodings
> would have the narrow and wide characters differ in value on
> little-endian systems. I believe it is safe to say that there are more
> programs in general than programs that try to interchange wide
> characters via byte streams/files.
>
>
> Can you point to IANA entries that designate EBCDIC wide encodings?
> (My guess is that the IANA entries designate multibyte shift-state
> encodings, not wide encodings, so maybe we should invent a
> recommendation what the derived naming convention should be.)
>
>
> Your guess is correct. The paper's EBCDIC example is somewhat the
> invented derived naming convention (although the last version I saw in
> the paper did not indicate the width).
>
>
> We want UTF-16 in wchar_t to be represented by "UTF16" without LE/BE.
> By analogy, we should want wide-EBCDIC to be represented by some
> name that does not offer an opinion on hardware endianness.
>
>
> Meaning that the invented names are meant to represent "native
> endian". This appears consistent with the "program portability" direction.
>
>
> Another question: What happens if wchar_t is 32-bits on a CHAR_BITS==8
> platform and uses UTF-16, using two wchar_ts for some code points?
> If we go for the "object representation iconv-compatible" model,
> we can't say this is is UTF-16, because the object representation
> has stray 0 bytes.
>
>
> Yes.
>
>
> So, it seems our implementation recommendations need to somehow
> express that "UTF16" can only be returned if sizeof(wchar_t) == 2.
> But it seems very plausible to also return "UTF16" if CHAR_BITS == 16
> and sizeof(wchar_t) == 1.
>
>
> We could try to go with non-normative explanations of intent regarding
> I/O serialization.
>
>
> We're not yet at the bottom of this, sorry.
>
> Jens
>
>

Received on 2021-10-13 20:05:20