sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 15 Jun 2020 18:11:07 +0200

On Mon, 15 Jun 2020 at 18:06, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/15/20 11:47 AM, Corentin Jabot wrote:
>
>
>
> On Mon, 15 Jun 2020 at 17:30, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>>
>>
>>
>> On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>
>>> On 15/06/2020 00.06, Hubert Tong wrote:
>>> > The presence of a UCN for a C1 (non-EBCDIC) control character in a
>>> supposedly-EBCDIC string is not immediately indicative of an error.
>>> In this example, is the UCN intending to mean the conventionally mapped
>>> EBCDIC control character, or something else?
>>>
>>> Beyond EBCDIC control characters, do we know of any other situation
>>> where input-to-Unicode mapping is not semantics-preserving or lossy?
>>> It would be good to keep a list in one of the upcoming papers, for
>>> the permanent record.
>>>
>>
>> There are 3 scenarios I can think of:
>>
>> - The control characters for EBCDIC , but also other encodings that
>> have more control characters beyond what exists in ascii, all of that maps
>> to C0/C1 in an application specific manner
>> - Some (~20) GB 10 830 characters map to the unicode private use area
>> which also doesn't "preserve semantic"
>> - Some Big5 characters do not have a unicode mapping at all ( that is
>> exclusively place and people names, and for example doesn't concern the
>> windows big 5 code pages)
>>
>> There are also the characters that have duplicate code point assignments
>> in Shift-JIS such that one of them won't round trip through Unicode. It
>> sounds like GB 18030 has one such case as well.
>>
> Semantic preserving is different from round trippable
>
> Absolutely; we know of cases where semantics are preserved and
> round-tripping is not, and of different cases where round-tripping is
> preserved but semantics are not.
>
>
> Consider a source character S1, U its internal representation and C1, C2
> to possible representation of that character in the execution encoding
> The two following mapping are valid, and preserve the semantic in phase 1
> and 5.
>
> S1 -> U -> C1
> S1 -> U -> C2
>
> It isn't observable from within the program which mapping was chosen and
> therefore an implementation could choose to prefer
> the mapping that happens to have the same byte value as in source.
>
> The Shift-JIS case is where characters S1 and S2 both map to the same C in
> U:
>
> S1 -> U -> C
> S2 -> U -> C
>
> That difference is (expected to be) observable in raw string literals.
>
> That behavior should, imo, neither be prescribed nor prevented.
>
>
> While the _wording_ loses information about the source encoding after
> phase 1, it doesn't mean that an implementation has to pretend it doesn't
> have perfect information when considering this scenario (but prescribing
> it would severely reduce implementation freedom and wouldn't match existing
> practices, which we should avoid).
>
> I believe we agree here. The problem is that the wording prevents
> discussing the scenario in formal terms.
>

What would be the benefit of discussing that in the wording ?

> Tom.
>

Received on 2020-06-15 11:14:28