Subject:	Shift-JIS NEC/IBM discussion
Date:	Mon, 26 Feb 2018 18:18:25 +0000
From:	Mark Zeren <mzeren@vmware.com>
To:	std-text-wg <std-text-wg@googlegroups.com>

copied from Slack for safe keeping:

sdowney [1 hour ago]

Shift-JIS has a few hundred distinct character pairs that were unified into the same unicode codepoints?

rmf [24 minutes ago]

That's only half correct. There are several problematic characters in the Japanese encoding standards, but this isn't an issue with Han unification.

rmf [22 minutes ago]

Those pairs are pairs *of the same character*, which happens to exist *twice* in common Shift-JIS codepages, like Microsoft's cp932.

rmf [20 minutes ago]

The reason those code pages encode the same character twice is because of the way Shift-JIS extensions occurred. Almost all of the problematic characters were added by NEC and by IBM at separate Shift-JIS code points (edited)

rmf [19 minutes ago]

Because these pairs don't overlap, Microsoft's code page doubles as an IBM-compatible Shift-JIS and as a NEC-compatible Shift-JIS by mapping both.

rmf [11 minutes ago]

So yes, some Shift-JIS codepages, like cp932, don't roundtrip with naive processes, but:

rmf [10 minutes ago]

1. the problem is specific to the code pages and unrelated to Han unification

rmf [10 minutes ago]

2. the problem is actually irrelevant unless you're interacting with e.g. NEC-only or IBM-only systems

rmf [10 minutes ago]

And 3. Unicode has mechanisms to actually roundtrip this properly if you need it.

rmf [7 minutes ago]

(If you want an example: NEC encoded 纊 at Shift-JIS position 0xED40; IBM encoded it at Shift-JIS position 0xFA5C; Unicode has it at U+7E8A)