Date: Mon, 15 Jun 2020 11:30:46 -0400
On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>
>
> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 15/06/2020 00.06, Hubert Tong wrote:
> > The presence of a UCN for a C1 (non-EBCDIC) control character in
> a supposedly-EBCDIC string is not immediately indicative of an error.
> In this example, is the UCN intending to mean the conventionally
> mapped
> EBCDIC control character, or something else?
>
> Beyond EBCDIC control characters, do we know of any other situation
> where input-to-Unicode mapping is not semantics-preserving or lossy?
> It would be good to keep a list in one of the upcoming papers, for
> the permanent record.
>
>
> There are 3 scenarios I can think of:
>
> * The control characters for EBCDIC , but also other encodings that
> have more control characters beyond what exists in ascii, all of
> that maps to C0/C1 in an application specific manner
> * Some (~20) GB 10 830 characters map to the unicode private use
> area which also doesn't "preserve semantic"
> * Some Big5 characters do not have a unicode mapping at all ( that
> is exclusively place and people names, and for example doesn't
> concern the windows big 5 code pages)
>
There are also the characters that have duplicate code point assignments
in Shift-JIS such that one of them won't round trip through Unicode. It
sounds like GB 18030 has one such case as well.
Tom.
>
>
> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 15/06/2020 00.06, Hubert Tong wrote:
> > The presence of a UCN for a C1 (non-EBCDIC) control character in
> a supposedly-EBCDIC string is not immediately indicative of an error.
> In this example, is the UCN intending to mean the conventionally
> mapped
> EBCDIC control character, or something else?
>
> Beyond EBCDIC control characters, do we know of any other situation
> where input-to-Unicode mapping is not semantics-preserving or lossy?
> It would be good to keep a list in one of the upcoming papers, for
> the permanent record.
>
>
> There are 3 scenarios I can think of:
>
> * The control characters for EBCDIC , but also other encodings that
> have more control characters beyond what exists in ascii, all of
> that maps to C0/C1 in an application specific manner
> * Some (~20) GB 10 830 characters map to the unicode private use
> area which also doesn't "preserve semantic"
> * Some Big5 characters do not have a unicode mapping at all ( that
> is exclusively place and people names, and for example doesn't
> concern the windows big 5 code pages)
>
There are also the characters that have duplicate code point assignments
in Shift-JIS such that one of them won't round trip through Unicode. It
sounds like GB 18030 has one such case as well.
Tom.
Received on 2020-06-15 10:33:57