On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:


On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 15/06/2020 00.06, Hubert Tong wrote:
> The presence of a UCN for a C1 (non-EBCDIC) control character in a supposedly-EBCDIC string is not immediately indicative of an error.
In this example, is the UCN intending to mean the conventionally mapped
EBCDIC control character, or something else?

Beyond EBCDIC control characters, do we know of any other situation
where input-to-Unicode mapping is not semantics-preserving or lossy?
It would be good to keep a list in one of the upcoming papers, for
the permanent record.

There are 3 scenarios I can think of:
  • The control characters for EBCDIC , but also other encodings that have more control characters beyond what exists in ascii, all of that maps to C0/C1 in an application specific manner
  • Some (~20) GB 10 830 characters map to the unicode private use area which also doesn't "preserve semantic"
  • Some Big5 characters do not have a unicode mapping at all ( that is exclusively place and people names, and for example doesn't concern the windows big 5 code pages)

There are also the characters that have duplicate code point assignments in Shift-JIS such that one of them won't round trip through Unicode.  It sounds like GB 18030 has one such case as well.

Tom.