sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 14 Jun 2020 17:46:14 -0400

On 6/14/20 3:55 PM, Jens Maurer wrote:
> On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
>
>>>> I don't think we should entertain any notion of "same character" in C++,
>>>> beyond value comparisons in the execution encoding and "identity" as
>>>> needed for "same identifier".
>>>>
>>>>
>>>> We need to in/before phase 1, but I think we reached the consensus that we otherwise
>>>> shouldn't and wouldn't
>>> I'm not sure we need to in phase 1 either. The only cases would be for conversion from source file characters that have multiple representations for the same semantic character, or (arguably) for Unicode normalization (which I believe we have consensus should not be performed in translation phase 1; in other words, EGCs are not "characters" for the purposes of translation phase 1).
>>>
>>>
>>> In phase 1 we need _something_
>>> Abstract character ( which is exactly what the standard calls "Physical Character" ) let us talk about the picture of the code case.
>> In phase 1, we need the concept of identity in order to map the source input to the basic source character set + UCNs. I think Jens was arguing more that we do not need (and should not need) the concept of equivalence.
>>
>>
>> Sure (as long as we accept that 1 abstract character may map to a sequence of code points (or UCNs))
> No, each code point in a sequence (given Unicode input) is a separate abstract character
> in my view (after combining surrogate pairs, of course).

Given Unicode input, yes. But we know of characters that are
represented by single code points in other character sets but that
require multiple code points to be represented in Unicode. One such
example can be found in Big5 HKCS where the double byte sequences
"\x88\x62" and "\x88\xA5" map to { U+00CA U+0304 } (Ê̄) and { U+00EA
U+030C } (ê̌) respectively. I think this is the kind of example that
Corentin was referring to.

Tom.

Received on 2020-06-14 16:49:26