sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 14 Jun 2020 15:16:51 -0400

On 6/14/20 1:39 PM, Corentin Jabot wrote:
>
>
> On Sun, 14 Jun 2020 at 19:21, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/14/20 5:03 AM, Corentin Jabot via SG16 wrote:
>>
>>
>> On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> On 11/06/2020 00.06, Hubert Tong wrote:
>> > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer
>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>
>> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>> >
>> > On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>> > > I agree with Corentin's point that the strict use of
>> abstract characters introduces problems where a coded
>> character set contains multiple values for a single abstract
>> character/contains characters that are canonically the same
>> but assigned different values.
>> >
>> > I have a hard time imagining such a thing. Can you
>> give an example?
>> >
>> > Yes, U+FA9A as described in
>> https://en.wikipedia.org/wiki/Han_unification has this
>> situation with U+6F22.
>> > These characters are distinct as members of a coded
>> character set, but as abstract characters, I do not believe
>> we can easily say the same.
>>
>> I would expect these to be two different abstract characters
>> in the C++ sense.
>> Roughly, anything you can distinguish in the source should be
>> a different
>> "abstract character", if only for the benefit of raw string
>> literals.
>>
>>
>> "abstract character" is a notion for humans and to talk about
>> mapping from one set to another.
>> After phase 1, C++ deals with code points such that two sequences
>> of code points are identical if and only if they have the same values
>
> I think Jens' point stands; two different abstract characters in
> the source should be differentiated post translation phase 1.
>
> Yes? But letters with different colors are not different abstract
> characters. Lets let experts decide what constitutes an abstract
> character.

I agree, but per other messages in this and other threads, experts
haven't fully defined mappings between character sets that fully
preserves semantics and we seem to be aware of implementations that are
impacted.

> Post translation phase 1, all we have are basic source characters
> and UCNs (with magical revert of UCNs for raw string literals);
> not code points.
>
> UCNs _are_ code points ( a sequence of UCNs carries exactly as much
> information as a sequence of code points )

A UCN is, for example, the character sequence '\', 'u', <hex-digit>,
<hex-digit>, <hex-digit>, <hex-digit>. The fact that it has a trivial
mapping to a code point doesn't make it a code point. I think we've
discussed the distinction in several contexts now.

The raw literal magic reversion suggests to me that, post phase 1,
something more is needed than just basic source characters + UCNs or
just code points.

>
>>
>> I don't think we should entertain any notion of "same
>> character" in C++,
>> beyond value comparisons in the execution encoding and
>> "identity" as
>> needed for "same identifier".
>>
>>
>> We need to in/before phase 1, but I think we reached the
>> consensus that we otherwise
>> shouldn't and wouldn't
> I'm not sure we need to in phase 1 either. The only cases would
> be for conversion from source file characters that have multiple
> representations for the same semantic character, or (arguably) for
> Unicode normalization (which I believe we have consensus should
> not be performed in translation phase 1; in other words, EGCs are
> not "characters" for the purposes of translation phase 1).
>
>
> In phase 1 we need _something_
> Abstract character ( which is exactly what the standard calls
> "Physical Character" ) let us talk about the picture of the code case.
In phase 1, we need the concept of identity in order to map the source
input to the basic source character set + UCNs. I think Jens was
arguing more that we do not need (and should not need) the concept of
equivalence.
>
>>
>> For example, if some hypothetical input format differentiates
>> red and
>> green letters that are otherwise "the same", I'd still expect
>> a red A
>> to be a different abstract character than a green A. (Ok,
>> that doesn't
>> work for the basic source character set, but should work for
>> anything
>> beyond that.)
>>
>>
>> It doesn't work as there isn't any culture on earth that make
>> that distinction such that there exist no
>> universal-character-name to make that distinction.
>> It is best left to people of letter to decide whether colors
>> carry meaning (and they sometimes do
>> https://en.wikipedia.org/wiki/Ersu_Shaba_script)
> I believe Jens was just illustrating a hypothetical argument for
> the purpose of advancing the point that differently encoded source
> input should be preserved.
>
>
> Yes, and I was explaining why that was not necessary
I think your response focused too much on color; think of it as a
charmed A and a strange A if that helps. The point was that conversion
to source character set should not be lossy and we know of cases where
it is lossy today.
>
>>
>> If that means the term "character" or "abstract character" is
>> too loaded
>> to be used here, so be it. (The terminology space is already
>> fairly
>> crowded due to Unicode, so it's hard to find unused phrases
>> that give the
>> right connotation.)
>>
>>
>> The terminology used by Unicode people isn't Unicode specific. In
>> particular, "abstract character" is meaningful independently of
>> any computer system.
> I tend to agree, but some terms such as "code point" are defined
> in ISO/IEC 10646 as Unicode specific. We'll need to be careful
> about use of such terms that are reachable from our normative
> references.
>
>
> There is a ton of precedence for using code point and code unit for
> arbitrary encoding - even if the terms originate from Unicode

I agree, and I would like to use those terms. I'm not sure if we can
use "code point" though because of its definition in ISO/IEC 10646
unless we provide an alternate definition (which I don't know if we can
from an ISO perspective).

Tom.

>> In general, I'm still hoping that a compiler in an
>> EBCDIC-only world
>> can fit seamlessly in our future model.
>>
>>
>> Yes, that is definitively a primary goal :)
>> In general we shouldn't restrict the set of possible source
>> character sets or execution character sets more than they are
>> currently.
>
> +1.
>
> Tom.
>
>>
>> Jens
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>

Received on 2020-06-14 14:20:11