C++ Logo

sg16

Advanced search

Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 15 Jun 2020 12:17:46 -0400
On 6/15/20 12:11 PM, Corentin Jabot wrote:
>
>
> On Mon, 15 Jun 2020 at 18:06, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 11:47 AM, Corentin Jabot wrote:
>>
>>
>> On Mon, 15 Jun 2020 at 17:30, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:
>>>
>>>
>>> On Mon, 15 Jun 2020 at 09:00, Jens Maurer
>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>
>>> On 15/06/2020 00.06, Hubert Tong wrote:
>>> > The presence of a UCN for a C1 (non-EBCDIC) control
>>> character in a supposedly-EBCDIC string is not
>>> immediately indicative of an error.
>>> In this example, is the UCN intending to mean the
>>> conventionally mapped
>>> EBCDIC control character, or something else?
>>>
>>> Beyond EBCDIC control characters, do we know of any
>>> other situation
>>> where input-to-Unicode mapping is not
>>> semantics-preserving or lossy?
>>> It would be good to keep a list in one of the upcoming
>>> papers, for
>>> the permanent record.
>>>
>>>
>>> There are 3 scenarios I can think of:
>>>
>>> * The control characters for EBCDIC , but also other
>>> encodings that have more control characters beyond what
>>> exists in ascii, all of that maps to C0/C1 in an
>>> application specific manner
>>> * Some (~20) GB 10 830 characters map to the unicode
>>> private use area which also doesn't "preserve semantic"
>>> * Some Big5 characters do not have a unicode mapping at
>>> all ( that is exclusively place and people names, and
>>> for example doesn't concern the windows big 5 code pages)
>>>
>> There are also the characters that have duplicate code point
>> assignments in Shift-JIS such that one of them won't round
>> trip through Unicode. It sounds like GB 18030 has one such
>> case as well.
>>
>> Semantic preserving is different from round trippable
> Absolutely; we know of cases where semantics are preserved and
> round-tripping is not, and of different cases where round-tripping
> is preserved but semantics are not.
>>
>> Consider a source character S1, U its internal representation and
>> C1, C2 to possible representation of that character in the
>> execution encoding
>> The two following mapping are valid, and preserve the semantic in
>> phase 1 and 5.
>>
>> S1 -> U -> C1
>> S1 -> U -> C2
>>
>> It isn't observable from within the program which mapping was
>> chosen and therefore an implementation could choose to prefer
>> the mapping that happens to have the same byte value as in source.
>
> The Shift-JIS case is where characters S1 and S2 both map to the
> same C in U:
>
> S1 -> U -> C
> S2 -> U -> C
>
> That difference is (expected to be) observable in raw string literals.
>
>> That behavior should, imo, neither be prescribed nor prevented.
>>
>> While the _wording_ loses information about the source encoding
>> after phase 1, it doesn't mean that an implementation has to
>> pretend it doesn't
>> have perfect information when considering this scenario (but
>> prescribing it would severely reduce implementation freedom and
>> wouldn't match existing practices, which we should avoid).
>
> I believe we agree here. The problem is that the wording prevents
> discussing the scenario in formal terms.
>
>
> What would be the benefit of discussing that in the wording ?

Avoiding long email threads and confusion about what the heck the
standard is specifying :)

Tom.


Received on 2020-06-15 11:21:25