sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 15 Jun 2020 18:20:10 +0200

On Mon, 15 Jun 2020 at 18:17, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/15/20 12:11 PM, Corentin Jabot wrote:
>
>
>
> On Mon, 15 Jun 2020 at 18:06, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/15/20 11:47 AM, Corentin Jabot wrote:
>>
>>
>>
>> On Mon, 15 Jun 2020 at 17:30, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>>>
>>>
>>>
>>> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>>
>>>> On 15/06/2020 00.06, Hubert Tong wrote:
>>>> > The presence of a UCN for a C1 (non-EBCDIC) control character in a
>>>> supposedly-EBCDIC string is not immediately indicative of an error.
>>>> In this example, is the UCN intending to mean the conventionally mapped
>>>> EBCDIC control character, or something else?
>>>>
>>>> Beyond EBCDIC control characters, do we know of any other situation
>>>> where input-to-Unicode mapping is not semantics-preserving or lossy?
>>>> It would be good to keep a list in one of the upcoming papers, for
>>>> the permanent record.
>>>>
>>>
>>> There are 3 scenarios I can think of:
>>>
>>> - The control characters for EBCDIC , but also other encodings that
>>> have more control characters beyond what exists in ascii, all of that maps
>>> to C0/C1 in an application specific manner
>>> - Some (~20) GB 10 830 characters map to the unicode private use
>>> area which also doesn't "preserve semantic"
>>> - Some Big5 characters do not have a unicode mapping at all ( that
>>> is exclusively place and people names, and for example doesn't concern the
>>> windows big 5 code pages)
>>>
>>> There are also the characters that have duplicate code point assignments
>>> in Shift-JIS such that one of them won't round trip through Unicode. It
>>> sounds like GB 18030 has one such case as well.
>>>
>> Semantic preserving is different from round trippable
>>
>> Absolutely; we know of cases where semantics are preserved and
>> round-tripping is not, and of different cases where round-tripping is
>> preserved but semantics are not.
>>
>>
>> Consider a source character S1, U its internal representation and C1, C2
>> to possible representation of that character in the execution encoding
>> The two following mapping are valid, and preserve the semantic in phase 1
>> and 5.
>>
>> S1 -> U -> C1
>> S1 -> U -> C2
>>
>> It isn't observable from within the program which mapping was chosen and
>> therefore an implementation could choose to prefer
>> the mapping that happens to have the same byte value as in source.
>>
>> The Shift-JIS case is where characters S1 and S2 both map to the same C
>> in U:
>>
>> S1 -> U -> C
>> S2 -> U -> C
>>
>> That difference is (expected to be) observable in raw string literals.
>>
>> That behavior should, imo, neither be prescribed nor prevented.
>>
>>
>> While the _wording_ loses information about the source encoding after
>> phase 1, it doesn't mean that an implementation has to pretend it doesn't
>> have perfect information when considering this scenario (but prescribing
>> it would severely reduce implementation freedom and wouldn't match existing
>> practices, which we should avoid).
>>
>> I believe we agree here. The problem is that the wording prevents
>> discussing the scenario in formal terms.
>>
>
> What would be the benefit of discussing that in the wording ?
>
> Avoiding long email threads and confusion about what the heck the standard
> is specifying :)
>
Concretely you would want a note that when multiple mappings are possible,
it is implementation defined which is chosen?

> Tom.
>

Received on 2020-06-15 11:23:31