sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 14 Jun 2020 21:33:24 +0200

On Sun, 14 Jun 2020 at 21:16, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/14/20 1:39 PM, Corentin Jabot wrote:
>
>
>
> On Sun, 14 Jun 2020 at 19:21, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/14/20 5:03 AM, Corentin Jabot via SG16 wrote:
>>
>>
>>
>> On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>>> On 11/06/2020 00.06, Hubert Tong wrote:
>>> > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]
>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>> >
>>> > On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>>> > > I agree with Corentin's point that the strict use of abstract
>>> characters introduces problems where a coded character set contains
>>> multiple values for a single abstract character/contains characters that
>>> are canonically the same but assigned different values.
>>> >
>>> > I have a hard time imagining such a thing. Can you give an
>>> example?
>>> >
>>> > Yes, U+FA9A as described in
>>> https://en.wikipedia.org/wiki/Han_unification has this situation with
>>> U+6F22.
>>> > These characters are distinct as members of a coded character set, but
>>> as abstract characters, I do not believe we can easily say the same.
>>>
>>> I would expect these to be two different abstract characters in the C++
>>> sense.
>>> Roughly, anything you can distinguish in the source should be a different
>>> "abstract character", if only for the benefit of raw string literals.
>>>
>>
>> "abstract character" is a notion for humans and to talk about mapping
>> from one set to another.
>> After phase 1, C++ deals with code points such that two sequences of code
>> points are identical if and only if they have the same values
>>
>> I think Jens' point stands; two different abstract characters in the
>> source should be differentiated post translation phase 1.
>>
> Yes? But letters with different colors are not different abstract
> characters. Lets let experts decide what constitutes an abstract character.
>
> I agree, but per other messages in this and other threads, experts haven't
> fully defined mappings between character sets that fully preserves
> semantics and we seem to be aware of implementations that are impacted.
>
Either they have, they will or they won't, it hardly should fall under
the purview of the C++ committee :)

>
>
>> Post translation phase 1, all we have are basic source characters and
>> UCNs (with magical revert of UCNs for raw string literals); not code points.
>>
> UCNs _are_ code points ( a sequence of UCNs carries exactly as much
> information as a sequence of code points )
>
> A UCN is, for example, the character sequence '\', 'u', <hex-digit>,
> <hex-digit>, <hex-digit>, <hex-digit>. The fact that it has a trivial
> mapping to a code point doesn't make it a code point. I think we've
> discussed the distinction in several contexts now.
>
There is a strict bijection

> The raw literal magic reversion suggests to me that, post phase 1,
> something more is needed than just basic source characters + UCNs or just
> code points.
>
I would like someone to give me 1 example of that :)
Also the raw literal magic reversion has nothing to do with any of it?

>
>
>
>>
>> I don't think we should entertain any notion of "same character" in C++,
>>> beyond value comparisons in the execution encoding and "identity" as
>>> needed for "same identifier".
>>>
>>
>> We need to in/before phase 1, but I think we reached the consensus that
>> we otherwise
>> shouldn't and wouldn't
>>
>> I'm not sure we need to in phase 1 either. The only cases would be for
>> conversion from source file characters that have multiple representations
>> for the same semantic character, or (arguably) for Unicode normalization
>> (which I believe we have consensus should not be performed in translation
>> phase 1; in other words, EGCs are not "characters" for the purposes of
>> translation phase 1).
>>
>
> In phase 1 we need _something_
> Abstract character ( which is exactly what the standard calls "Physical
> Character" ) let us talk about the picture of the code case.
>
> In phase 1, we need the concept of identity in order to map the source
> input to the basic source character set + UCNs. I think Jens was arguing
> more that we do not need (and should not need) the concept of equivalence.
>

Sure (as long as we accept that 1 abstract character may map to a sequence
of code points (or UCNs))

For example, if some hypothetical input format differentiates red and
>>> green letters that are otherwise "the same", I'd still expect a red A
>>> to be a different abstract character than a green A. (Ok, that doesn't
>>> work for the basic source character set, but should work for anything
>>> beyond that.)
>>>
>>
>> It doesn't work as there isn't any culture on earth that make that
>> distinction such that there exist no universal-character-name to make that
>> distinction.
>> It is best left to people of letter to decide whether colors carry
>> meaning (and they sometimes do
>> https://en.wikipedia.org/wiki/Ersu_Shaba_script)
>>
>> I believe Jens was just illustrating a hypothetical argument for the
>> purpose of advancing the point that differently encoded source input should
>> be preserved.
>>
>
> Yes, and I was explaining why that was not necessary
>
> I think your response focused too much on color; think of it as a charmed
> A and a strange A if that helps. The point was that conversion to source
> character set should not be lossy and we know of cases where it is lossy
> today.

Again I would like to see an example of that (and again the point of a
mapping to a unicode codepoint or UCN is that a mapping exists, not that a
mapping is executed).
I am arguing for a more precise mapping, not less :)

>
>> If that means the term "character" or "abstract character" is too loaded
>>> to be used here, so be it. (The terminology space is already fairly
>>> crowded due to Unicode, so it's hard to find unused phrases that give the
>>> right connotation.)
>>>
>>
>> The terminology used by Unicode people isn't Unicode specific. In
>> particular, "abstract character" is meaningful independently of
>> any computer system.
>>
>> I tend to agree, but some terms such as "code point" are defined in
>> ISO/IEC 10646 as Unicode specific. We'll need to be careful about use of
>> such terms that are reachable from our normative references.
>>
>
> There is a ton of precedence for using code point and code unit for
> arbitrary encoding - even if the terms originate from Unicode
>
> I agree, and I would like to use those terms. I'm not sure if we can use
> "code point" though because of its definition in ISO/IEC 10646 unless we
> provide an alternate definition (which I don't know if we can from an ISO
> perspective).
>
I think we agreed that we should define the terms regardless of their
existence in ISO/IEC 10646 ? Are there some iso constraints on that?

> Tom.
>
>
>>
>>> In general, I'm still hoping that a compiler in an EBCDIC-only world
>>> can fit seamlessly in our future model.
>>>
>>
>> Yes, that is definitively a primary goal :)
>> In general we shouldn't restrict the set of possible source character
>> sets or execution character sets more than they are currently.
>>
>> +1.
>>
>> Tom.
>>
>>
>>
>>>
>>> Jens
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
>>
>>
>

Received on 2020-06-14 14:36:46