sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 14 Jun 2020 20:48:07 +0200

On Sun, 14 Jun 2020 at 20:03, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Sun, Jun 14, 2020 at 5:03 AM Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>
>>
>>
>> On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>>> On 11/06/2020 00.06, Hubert Tong wrote:
>>> > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]
>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>> >
>>> > On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>>> > > I agree with Corentin's point that the strict use of abstract
>>> characters introduces problems where a coded character set contains
>>> multiple values for a single abstract character/contains characters that
>>> are canonically the same but assigned different values.
>>>
>> To be clear, the statement I made above is an indication that I now
> believe that the notion of "coded character set" is necessary for making
> progress for our purposes.
>
>
>> >
>>> > I have a hard time imagining such a thing. Can you give an
>>> example?
>>> >
>>> > Yes, U+FA9A as described in
>>> https://en.wikipedia.org/wiki/Han_unification has this situation with
>>> U+6F22.
>>> > These characters are distinct as members of a coded character set, but
>>> as abstract characters, I do not believe we can easily say the same.
>>>
>>> I would expect these to be two different abstract characters in the C++
>>> sense.
>>> Roughly, anything you can distinguish in the source should be a different
>>> "abstract character", if only for the benefit of raw string literals.
>>>
>>
>> "abstract character" is a notion for humans
>>
> +1
>
>
>> and to talk about mapping from one set to another.
>>
>> After phase 1, C++ deals with code points such that two sequences of code
>> points are identical if and only if they have the same values
>>
> This seems to be a source of getting hung up on terminology. I think this
> could help: The above sentence can be read as a tautology. A "code point"
> (within a coded character set) is synonymous with the value component of a
> coded character within that coded character set. Unfortunately, "value in
> the UCS codespace" is chosen as the "definition" for "code point" in
> ISO/IEC 10646.
>

It is a bit tautologic yes :)

>
>
>>
>> I don't think we should entertain any notion of "same character" in C++,
>>> beyond value comparisons in the execution encoding and "identity" as
>>> needed for "same identifier".
>>>
>>
>> We need to in/before phase 1, but I think we reached the consensus that
>> we otherwise
>> shouldn't and wouldn't
>>
> To be clear, we need to make sure we are on the same page with respect to
> the meta (notion of) notion of "same character":
> By "character", do we mean an "abstract character" or a "coded character"?
>

abstract character in phase 1 ( to get rid of "abstract character" in phase
1, we would have to assume that we have encoded text already - I think that
would be a reasonable assumption )

>
>
>>
>>
>>>
>>> For example, if some hypothetical input format differentiates red and
>>> green letters that are otherwise "the same", I'd still expect a red A
>>> to be a different abstract character than a green A. (Ok, that doesn't
>>> work for the basic source character set, but should work for anything
>>> beyond that.)
>>>
>>
>> It doesn't work as there isn't any culture on earth that make that
>> distinction such that there exist no universal-character-name to make that
>> distinction.
>> It is best left to people of letter to decide whether colors carry
>> meaning (and they sometimes do
>> https://en.wikipedia.org/wiki/Ersu_Shaba_script)
>>
>> If that means the term "character" or "abstract character" is too loaded
>>> to be used here, so be it. (The terminology space is already fairly
>>> crowded due to Unicode, so it's hard to find unused phrases that give the
>>> right connotation.)
>>>
>>
>> The terminology used by Unicode people isn't Unicode specific. In
>> particular, "abstract character" is meaningful independently of
>> any computer system.
>>
> I think that the relationships between terms represent an ideal that is
> not met in practice. "Abstract character" is a meaningful notion; however,
> the ideal that coded character sets are a bijective function between values
> in a codespace and abstract characters has not been clearly attained.
>

Coded characters sets encode a set of abstract characters (unicode has
non-characters) .

Somer abstract characters do not exist in any coded character set. There
are abstract characters not yet represented in computers that cannot be
handled by a C++ implementation
Unicode encodes 140 000+ characters.
The important point is that all characters that can be represented in any
encoding supported by xlC, clang, msvc, gcc, edg, etc on any system do have
a mapping in Unicode ( in the case of EBCDIC, for the control characters (
which are hardly characters as people understand that word), the mapping is
more prescriptive than it is semantic, but a mapping do exist).

Received on 2020-06-14 13:51:27