C++ Logo

sg16

Advanced search

Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 14 Jun 2020 19:39:04 +0200
On Sun, 14 Jun 2020 at 19:21, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/14/20 5:03 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16 <sg16_at_[hidden]>
> wrote:
>
>> On 11/06/2020 00.06, Hubert Tong wrote:
>> > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>> > On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>> > > I agree with Corentin's point that the strict use of abstract
>> characters introduces problems where a coded character set contains
>> multiple values for a single abstract character/contains characters that
>> are canonically the same but assigned different values.
>> >
>> > I have a hard time imagining such a thing. Can you give an example?
>> >
>> > Yes, U+FA9A as described in
>> https://en.wikipedia.org/wiki/Han_unification has this situation with
>> U+6F22.
>> > These characters are distinct as members of a coded character set, but
>> as abstract characters, I do not believe we can easily say the same.
>>
>> I would expect these to be two different abstract characters in the C++
>> sense.
>> Roughly, anything you can distinguish in the source should be a different
>> "abstract character", if only for the benefit of raw string literals.
>>
>
> "abstract character" is a notion for humans and to talk about mapping from
> one set to another.
> After phase 1, C++ deals with code points such that two sequences of code
> points are identical if and only if they have the same values
>
> I think Jens' point stands; two different abstract characters in the
> source should be differentiated post translation phase 1.
>
Yes? But letters with different colors are not different abstract
characters. Lets let experts decide what constitutes an abstract character.


> Post translation phase 1, all we have are basic source characters and UCNs
> (with magical revert of UCNs for raw string literals); not code points.
>
UCNs _are_ code points ( a sequence of UCNs carries exactly as much
information as a sequence of code points )



>
> I don't think we should entertain any notion of "same character" in C++,
>> beyond value comparisons in the execution encoding and "identity" as
>> needed for "same identifier".
>>
>
> We need to in/before phase 1, but I think we reached the consensus that we
> otherwise
> shouldn't and wouldn't
>
> I'm not sure we need to in phase 1 either. The only cases would be for
> conversion from source file characters that have multiple representations
> for the same semantic character, or (arguably) for Unicode normalization
> (which I believe we have consensus should not be performed in translation
> phase 1; in other words, EGCs are not "characters" for the purposes of
> translation phase 1).
>

In phase 1 we need _something_
Abstract character ( which is exactly what the standard calls "Physical
Character" ) let us talk about the picture of the code case.

>
>
>>
>> For example, if some hypothetical input format differentiates red and
>> green letters that are otherwise "the same", I'd still expect a red A
>> to be a different abstract character than a green A. (Ok, that doesn't
>> work for the basic source character set, but should work for anything
>> beyond that.)
>>
>
> It doesn't work as there isn't any culture on earth that make that
> distinction such that there exist no universal-character-name to make that
> distinction.
> It is best left to people of letter to decide whether colors carry meaning
> (and they sometimes do https://en.wikipedia.org/wiki/Ersu_Shaba_script)
>
> I believe Jens was just illustrating a hypothetical argument for the
> purpose of advancing the point that differently encoded source input should
> be preserved.
>

Yes, and I was explaining why that was not necessary

>
> If that means the term "character" or "abstract character" is too loaded
>> to be used here, so be it. (The terminology space is already fairly
>> crowded due to Unicode, so it's hard to find unused phrases that give the
>> right connotation.)
>>
>
> The terminology used by Unicode people isn't Unicode specific. In
> particular, "abstract character" is meaningful independently of
> any computer system.
>
> I tend to agree, but some terms such as "code point" are defined in
> ISO/IEC 10646 as Unicode specific. We'll need to be careful about use of
> such terms that are reachable from our normative references.
>

There is a ton of precedence for using code point and code unit for
arbitrary encoding - even if the terms originate from Unicode

>
>
>> In general, I'm still hoping that a compiler in an EBCDIC-only world
>> can fit seamlessly in our future model.
>>
>
> Yes, that is definitively a primary goal :)
> In general we shouldn't restrict the set of possible source character sets
> or execution character sets more than they are currently.
>
> +1.
>
> Tom.
>
>
>
>>
>> Jens
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>
>

Received on 2020-06-14 12:42:24