Date: Mon, 15 Jun 2020 12:06:53 -0400
On 6/15/20 11:47 AM, Corentin Jabot wrote:
>
>
> On Mon, 15 Jun 2020 at 17:30, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>>
>>
>> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>
>> On 15/06/2020 00.06, Hubert Tong wrote:
>> > The presence of a UCN for a C1 (non-EBCDIC) control
>> character in a supposedly-EBCDIC string is not immediately
>> indicative of an error.
>> In this example, is the UCN intending to mean the
>> conventionally mapped
>> EBCDIC control character, or something else?
>>
>> Beyond EBCDIC control characters, do we know of any other
>> situation
>> where input-to-Unicode mapping is not semantics-preserving or
>> lossy?
>> It would be good to keep a list in one of the upcoming
>> papers, for
>> the permanent record.
>>
>>
>> There are 3 scenarios I can think of:
>>
>> * The control characters for EBCDIC , but also other encodings
>> that have more control characters beyond what exists in
>> ascii, all of that maps to C0/C1 in an application specific
>> manner
>> * Some (~20) GB 10 830 characters map to the unicode
>> private use area which also doesn't "preserve semantic"
>> * Some Big5 characters do not have a unicode mapping at all (
>> that is exclusively place and people names, and for example
>> doesn't concern the windows big 5 code pages)
>>
> There are also the characters that have duplicate code point
> assignments in Shift-JIS such that one of them won't round trip
> through Unicode. It sounds like GB 18030 has one such case as well.
>
> Semantic preserving is different from round trippable
Absolutely; we know of cases where semantics are preserved and
round-tripping is not, and of different cases where round-tripping is
preserved but semantics are not.
>
> Consider a source character S1, U its internal representation and C1,
> C2 to possible representation of that character in the execution encoding
> The two following mapping are valid, and preserve the semantic in
> phase 1 and 5.
>
> S1 -> U -> C1
> S1 -> U -> C2
>
> It isn't observable from within the program which mapping was chosen
> and therefore an implementation could choose to prefer
> the mapping that happens to have the same byte value as in source.
The Shift-JIS case is where characters S1 and S2 both map to the same C
in U:
S1 -> U -> C
S2 -> U -> C
That difference is (expected to be) observable in raw string literals.
> That behavior should, imo, neither be prescribed nor prevented.
>
> While the _wording_ loses information about the source encoding after
> phase 1, it doesn't mean that an implementation has to pretend it doesn't
> have perfect information when considering this scenario (but
> prescribing it would severely reduce implementation freedom and
> wouldn't match existing practices, which we should avoid).
I believe we agree here. The problem is that the wording prevents
discussing the scenario in formal terms.
Tom.
>
>
> On Mon, 15 Jun 2020 at 17:30, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>>
>>
>> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>
>> On 15/06/2020 00.06, Hubert Tong wrote:
>> > The presence of a UCN for a C1 (non-EBCDIC) control
>> character in a supposedly-EBCDIC string is not immediately
>> indicative of an error.
>> In this example, is the UCN intending to mean the
>> conventionally mapped
>> EBCDIC control character, or something else?
>>
>> Beyond EBCDIC control characters, do we know of any other
>> situation
>> where input-to-Unicode mapping is not semantics-preserving or
>> lossy?
>> It would be good to keep a list in one of the upcoming
>> papers, for
>> the permanent record.
>>
>>
>> There are 3 scenarios I can think of:
>>
>> * The control characters for EBCDIC , but also other encodings
>> that have more control characters beyond what exists in
>> ascii, all of that maps to C0/C1 in an application specific
>> manner
>> * Some (~20) GB 10 830 characters map to the unicode
>> private use area which also doesn't "preserve semantic"
>> * Some Big5 characters do not have a unicode mapping at all (
>> that is exclusively place and people names, and for example
>> doesn't concern the windows big 5 code pages)
>>
> There are also the characters that have duplicate code point
> assignments in Shift-JIS such that one of them won't round trip
> through Unicode. It sounds like GB 18030 has one such case as well.
>
> Semantic preserving is different from round trippable
Absolutely; we know of cases where semantics are preserved and
round-tripping is not, and of different cases where round-tripping is
preserved but semantics are not.
>
> Consider a source character S1, U its internal representation and C1,
> C2 to possible representation of that character in the execution encoding
> The two following mapping are valid, and preserve the semantic in
> phase 1 and 5.
>
> S1 -> U -> C1
> S1 -> U -> C2
>
> It isn't observable from within the program which mapping was chosen
> and therefore an implementation could choose to prefer
> the mapping that happens to have the same byte value as in source.
The Shift-JIS case is where characters S1 and S2 both map to the same C
in U:
S1 -> U -> C
S2 -> U -> C
That difference is (expected to be) observable in raw string literals.
> That behavior should, imo, neither be prescribed nor prevented.
>
> While the _wording_ loses information about the source encoding after
> phase 1, it doesn't mean that an implementation has to pretend it doesn't
> have perfect information when considering this scenario (but
> prescribing it would severely reduce implementation freedom and
> wouldn't match existing practices, which we should avoid).
I believe we agree here. The problem is that the wording prevents
discussing the scenario in formal terms.
Tom.
Received on 2020-06-15 11:11:01