sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sun, 14 Jun 2020 14:03:28 -0400

On Sun, Jun 14, 2020 at 5:03 AM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16 <sg16_at_[hidden]>
> wrote:
>
>> On 11/06/2020 00.06, Hubert Tong wrote:
>> > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>> > On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>> > > I agree with Corentin's point that the strict use of abstract
>> characters introduces problems where a coded character set contains
>> multiple values for a single abstract character/contains characters that
>> are canonically the same but assigned different values.
>>
> To be clear, the statement I made above is an indication that I now
believe that the notion of "coded character set" is necessary for making
progress for our purposes.

> >
>> > I have a hard time imagining such a thing. Can you give an example?
>> >
>> > Yes, U+FA9A as described in
>> https://en.wikipedia.org/wiki/Han_unification has this situation with
>> U+6F22.
>> > These characters are distinct as members of a coded character set, but
>> as abstract characters, I do not believe we can easily say the same.
>>
>> I would expect these to be two different abstract characters in the C++
>> sense.
>> Roughly, anything you can distinguish in the source should be a different
>> "abstract character", if only for the benefit of raw string literals.
>>
>
> "abstract character" is a notion for humans
>
+1

> and to talk about mapping from one set to another.
>
> After phase 1, C++ deals with code points such that two sequences of code
> points are identical if and only if they have the same values
>
This seems to be a source of getting hung up on terminology. I think this
could help: The above sentence can be read as a tautology. A "code point"
(within a coded character set) is synonymous with the value component of a
coded character within that coded character set. Unfortunately, "value in
the UCS codespace" is chosen as the "definition" for "code point" in
ISO/IEC 10646.

>
> I don't think we should entertain any notion of "same character" in C++,
>> beyond value comparisons in the execution encoding and "identity" as
>> needed for "same identifier".
>>
>
> We need to in/before phase 1, but I think we reached the consensus that we
> otherwise
> shouldn't and wouldn't
>
To be clear, we need to make sure we are on the same page with respect to
the meta (notion of) notion of "same character":
By "character", do we mean an "abstract character" or a "coded character"?

>
>
>>
>> For example, if some hypothetical input format differentiates red and
>> green letters that are otherwise "the same", I'd still expect a red A
>> to be a different abstract character than a green A. (Ok, that doesn't
>> work for the basic source character set, but should work for anything
>> beyond that.)
>>
>
> It doesn't work as there isn't any culture on earth that make that
> distinction such that there exist no universal-character-name to make that
> distinction.
> It is best left to people of letter to decide whether colors carry meaning
> (and they sometimes do https://en.wikipedia.org/wiki/Ersu_Shaba_script)
>
> If that means the term "character" or "abstract character" is too loaded
>> to be used here, so be it. (The terminology space is already fairly
>> crowded due to Unicode, so it's hard to find unused phrases that give the
>> right connotation.)
>>
>
> The terminology used by Unicode people isn't Unicode specific. In
> particular, "abstract character" is meaningful independently of
> any computer system.
>
I think that the relationships between terms represent an ideal that is not
met in practice. "Abstract character" is a meaningful notion; however, the
ideal that coded character sets are a bijective function between values in
a codespace and abstract characters has not been clearly attained.

Received on 2020-06-14 13:06:55