Date: Mon, 15 Jun 2020 17:47:59 +0200
On Mon, 15 Jun 2020 at 17:30, Tom Honermann <tom_at_[hidden]> wrote:
> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 15/06/2020 00.06, Hubert Tong wrote:
>> > The presence of a UCN for a C1 (non-EBCDIC) control character in a
>> supposedly-EBCDIC string is not immediately indicative of an error.
>> In this example, is the UCN intending to mean the conventionally mapped
>> EBCDIC control character, or something else?
>>
>> Beyond EBCDIC control characters, do we know of any other situation
>> where input-to-Unicode mapping is not semantics-preserving or lossy?
>> It would be good to keep a list in one of the upcoming papers, for
>> the permanent record.
>>
>
> There are 3 scenarios I can think of:
>
> - The control characters for EBCDIC , but also other encodings that
> have more control characters beyond what exists in ascii, all of that maps
> to C0/C1 in an application specific manner
> - Some (~20) GB 10 830 characters map to the unicode private use area
> which also doesn't "preserve semantic"
> - Some Big5 characters do not have a unicode mapping at all ( that is
> exclusively place and people names, and for example doesn't concern the
> windows big 5 code pages)
>
> There are also the characters that have duplicate code point assignments
> in Shift-JIS such that one of them won't round trip through Unicode. It
> sounds like GB 18030 has one such case as well.
>
Semantic preserving is different from round trippable
Consider a source character S1, U its internal representation and C1, C2 to
possible representation of that character in the execution encoding
The two following mapping are valid, and preserve the semantic in phase 1
and 5.
S1 -> U -> C1
S1 -> U -> C2
It isn't observable from within the program which mapping was chosen and
therefore an implementation could choose to prefer
the mapping that happens to have the same byte value as in source.
That behavior should, imo, neither be prescribed nor prevented.
While the _wording_ loses information about the source encoding after phase
1, it doesn't mean that an implementation has to pretend it doesn't
have perfect information when considering this scenario (but prescribing
it would severely reduce implementation freedom and wouldn't match existing
practices, which we should avoid).
> Tom.
>
> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 15/06/2020 00.06, Hubert Tong wrote:
>> > The presence of a UCN for a C1 (non-EBCDIC) control character in a
>> supposedly-EBCDIC string is not immediately indicative of an error.
>> In this example, is the UCN intending to mean the conventionally mapped
>> EBCDIC control character, or something else?
>>
>> Beyond EBCDIC control characters, do we know of any other situation
>> where input-to-Unicode mapping is not semantics-preserving or lossy?
>> It would be good to keep a list in one of the upcoming papers, for
>> the permanent record.
>>
>
> There are 3 scenarios I can think of:
>
> - The control characters for EBCDIC , but also other encodings that
> have more control characters beyond what exists in ascii, all of that maps
> to C0/C1 in an application specific manner
> - Some (~20) GB 10 830 characters map to the unicode private use area
> which also doesn't "preserve semantic"
> - Some Big5 characters do not have a unicode mapping at all ( that is
> exclusively place and people names, and for example doesn't concern the
> windows big 5 code pages)
>
> There are also the characters that have duplicate code point assignments
> in Shift-JIS such that one of them won't round trip through Unicode. It
> sounds like GB 18030 has one such case as well.
>
Semantic preserving is different from round trippable
Consider a source character S1, U its internal representation and C1, C2 to
possible representation of that character in the execution encoding
The two following mapping are valid, and preserve the semantic in phase 1
and 5.
S1 -> U -> C1
S1 -> U -> C2
It isn't observable from within the program which mapping was chosen and
therefore an implementation could choose to prefer
the mapping that happens to have the same byte value as in source.
That behavior should, imo, neither be prescribed nor prevented.
While the _wording_ loses information about the source encoding after phase
1, it doesn't mean that an implementation has to pretend it doesn't
have perfect information when considering this scenario (but prescribing
it would severely reduce implementation freedom and wouldn't match existing
practices, which we should avoid).
> Tom.
>
Received on 2020-06-15 10:51:20