Date: Mon, 15 Jun 2020 12:54:36 -0400
On 6/15/20 12:20 PM, Corentin Jabot wrote:
>
>
> On Mon, 15 Jun 2020 at 18:17, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 12:11 PM, Corentin Jabot wrote:
>>
>>
>> On Mon, 15 Jun 2020 at 18:06, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/15/20 11:47 AM, Corentin Jabot wrote:
>>>
>>>
>>> On Mon, 15 Jun 2020 at 17:30, Tom Honermann
>>> <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>>>>
>>>>
>>>> On Mon, 15 Jun 2020 at 09:00, Jens Maurer
>>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>>
>>>> On 15/06/2020 00.06, Hubert Tong wrote:
>>>> > The presence of a UCN for a C1 (non-EBCDIC)
>>>> control character in a supposedly-EBCDIC string is
>>>> not immediately indicative of an error.
>>>> In this example, is the UCN intending to mean the
>>>> conventionally mapped
>>>> EBCDIC control character, or something else?
>>>>
>>>> Beyond EBCDIC control characters, do we know of any
>>>> other situation
>>>> where input-to-Unicode mapping is not
>>>> semantics-preserving or lossy?
>>>> It would be good to keep a list in one of the
>>>> upcoming papers, for
>>>> the permanent record.
>>>>
>>>>
>>>> There are 3 scenarios I can think of:
>>>>
>>>> * The control characters for EBCDIC , but also other
>>>> encodings that have more control characters beyond
>>>> what exists in ascii, all of that maps to C0/C1 in
>>>> an application specific manner
>>>> * Some (~20) GB 10 830 characters map to the unicode
>>>> private use area which also doesn't "preserve semantic"
>>>> * Some Big5 characters do not have a unicode mapping
>>>> at all ( that is exclusively place and people
>>>> names, and for example doesn't concern the windows
>>>> big 5 code pages)
>>>>
>>> There are also the characters that have duplicate code
>>> point assignments in Shift-JIS such that one of them
>>> won't round trip through Unicode. It sounds like GB
>>> 18030 has one such case as well.
>>>
>>> Semantic preserving is different from round trippable
>> Absolutely; we know of cases where semantics are preserved
>> and round-tripping is not, and of different cases where
>> round-tripping is preserved but semantics are not.
>>>
>>> Consider a source character S1, U its internal
>>> representation and C1, C2 to possible representation of that
>>> character in the execution encoding
>>> The two following mapping are valid, and preserve the
>>> semantic in phase 1 and 5.
>>>
>>> S1 -> U -> C1
>>> S1 -> U -> C2
>>>
>>> It isn't observable from within the program which mapping
>>> was chosen and therefore an implementation could choose to
>>> prefer
>>> the mapping that happens to have the same byte value as in
>>> source.
>>
>> The Shift-JIS case is where characters S1 and S2 both map to
>> the same C in U:
>>
>> S1 -> U -> C
>> S2 -> U -> C
>>
>> That difference is (expected to be) observable in raw string
>> literals.
>>
>>> That behavior should, imo, neither be prescribed nor prevented.
>>>
>>> While the _wording_ loses information about the source
>>> encoding after phase 1, it doesn't mean that an
>>> implementation has to pretend it doesn't
>>> have perfect information when considering this scenario (but
>>> prescribing it would severely reduce implementation freedom
>>> and wouldn't match existing practices, which we should avoid).
>>
>> I believe we agree here. The problem is that the wording
>> prevents discussing the scenario in formal terms.
>>
>>
>> What would be the benefit of discussing that in the wording ?
>
> Avoiding long email threads and confusion about what the heck the
> standard is specifying :)
>
> Concretely you would want a note that when multiple mappings are
> possible, it is implementation defined which is chosen?
No, I want a mechanism that can carry implementation-defined source
input information through phase 1 to use in phase 3 (in an
implementation-defined manner).
Tom.
>
>
> On Mon, 15 Jun 2020 at 18:17, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 12:11 PM, Corentin Jabot wrote:
>>
>>
>> On Mon, 15 Jun 2020 at 18:06, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/15/20 11:47 AM, Corentin Jabot wrote:
>>>
>>>
>>> On Mon, 15 Jun 2020 at 17:30, Tom Honermann
>>> <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>>>>
>>>>
>>>> On Mon, 15 Jun 2020 at 09:00, Jens Maurer
>>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>>
>>>> On 15/06/2020 00.06, Hubert Tong wrote:
>>>> > The presence of a UCN for a C1 (non-EBCDIC)
>>>> control character in a supposedly-EBCDIC string is
>>>> not immediately indicative of an error.
>>>> In this example, is the UCN intending to mean the
>>>> conventionally mapped
>>>> EBCDIC control character, or something else?
>>>>
>>>> Beyond EBCDIC control characters, do we know of any
>>>> other situation
>>>> where input-to-Unicode mapping is not
>>>> semantics-preserving or lossy?
>>>> It would be good to keep a list in one of the
>>>> upcoming papers, for
>>>> the permanent record.
>>>>
>>>>
>>>> There are 3 scenarios I can think of:
>>>>
>>>> * The control characters for EBCDIC , but also other
>>>> encodings that have more control characters beyond
>>>> what exists in ascii, all of that maps to C0/C1 in
>>>> an application specific manner
>>>> * Some (~20) GB 10 830 characters map to the unicode
>>>> private use area which also doesn't "preserve semantic"
>>>> * Some Big5 characters do not have a unicode mapping
>>>> at all ( that is exclusively place and people
>>>> names, and for example doesn't concern the windows
>>>> big 5 code pages)
>>>>
>>> There are also the characters that have duplicate code
>>> point assignments in Shift-JIS such that one of them
>>> won't round trip through Unicode. It sounds like GB
>>> 18030 has one such case as well.
>>>
>>> Semantic preserving is different from round trippable
>> Absolutely; we know of cases where semantics are preserved
>> and round-tripping is not, and of different cases where
>> round-tripping is preserved but semantics are not.
>>>
>>> Consider a source character S1, U its internal
>>> representation and C1, C2 to possible representation of that
>>> character in the execution encoding
>>> The two following mapping are valid, and preserve the
>>> semantic in phase 1 and 5.
>>>
>>> S1 -> U -> C1
>>> S1 -> U -> C2
>>>
>>> It isn't observable from within the program which mapping
>>> was chosen and therefore an implementation could choose to
>>> prefer
>>> the mapping that happens to have the same byte value as in
>>> source.
>>
>> The Shift-JIS case is where characters S1 and S2 both map to
>> the same C in U:
>>
>> S1 -> U -> C
>> S2 -> U -> C
>>
>> That difference is (expected to be) observable in raw string
>> literals.
>>
>>> That behavior should, imo, neither be prescribed nor prevented.
>>>
>>> While the _wording_ loses information about the source
>>> encoding after phase 1, it doesn't mean that an
>>> implementation has to pretend it doesn't
>>> have perfect information when considering this scenario (but
>>> prescribing it would severely reduce implementation freedom
>>> and wouldn't match existing practices, which we should avoid).
>>
>> I believe we agree here. The problem is that the wording
>> prevents discussing the scenario in formal terms.
>>
>>
>> What would be the benefit of discussing that in the wording ?
>
> Avoiding long email threads and confusion about what the heck the
> standard is specifying :)
>
> Concretely you would want a note that when multiple mappings are
> possible, it is implementation defined which is chosen?
No, I want a mechanism that can carry implementation-defined source
input information through phase 1 to use in phase 3 (in an
implementation-defined manner).
Tom.
Received on 2020-06-15 11:57:48