Preserving from the old std-text-wg@googlegroups.com mailing
list.
| Subject: | Shift-JIS NEC/IBM discussion |
|---|---|
| Date: | Mon, 26 Feb 2018 18:18:25 +0000 |
| From: | Mark Zeren <mzeren@vmware.com> |
| To: | std-text-wg <std-text-wg@googlegroups.com> |
copied from
Slack for safe keeping:
sdowney [1
hour ago]
Shift-JIS
has a few hundred distinct character pairs that were unified
into the same unicode codepoints?
rmf [24
minutes ago]
That's only
half correct. There are several problematic characters in
the Japanese encoding standards, but this isn't an issue
with Han unification.
rmf [22
minutes ago]
Those pairs
are pairs *of the same character*, which happens to exist
*twice* in common Shift-JIS codepages, like Microsoft's
cp932.
rmf [20
minutes ago]
The reason
those code pages encode the same character twice is because
of the way Shift-JIS extensions occurred. Almost all of the
problematic characters were added by NEC and by IBM at
separate Shift-JIS code points (edited)
rmf [19
minutes ago]
Because
these pairs don't overlap, Microsoft's code page doubles as
an IBM-compatible Shift-JIS and as a NEC-compatible
Shift-JIS by mapping both.
rmf [11
minutes ago]
So yes, some
Shift-JIS codepages, like cp932, don't roundtrip with naive
processes, but:
rmf [10
minutes ago]
1. the
problem is specific to the code pages and unrelated to Han
unification
rmf [10
minutes ago]
2. the
problem is actually irrelevant unless you're interacting
with e.g. NEC-only or IBM-only systems
rmf [10
minutes ago]
And 3.
Unicode has mechanisms to actually roundtrip this properly
if you need it.
rmf [7
minutes ago]
(If you want
an example: NEC encoded
纊 at
Shift-JIS position 0xED40; IBM encoded it at Shift-JIS
position 0xFA5C; Unicode has it at U+7E8A)