Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-14 13:48:07
On Sun, 14 Jun 2020 at 20:03, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> On Sun, Jun 14, 2020 at 5:03 AM Corentin Jabot <corentinjabot_at_[hidden]>
>> On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16 <sg16_at_[hidden]>
>>> On 11/06/2020 00.06, Hubert Tong wrote:
>>> > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]
>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>> > On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>>> > > I agree with Corentin's point that the strict use of abstract
>>> characters introduces problems where a coded character set contains
>>> multiple values for a single abstract character/contains characters that
>>> are canonically the same but assigned different values.
>> To be clear, the statement I made above is an indication that I now
> believe that the notion of "coded character set" is necessary for making
> progress for our purposes.
>>> > I have a hard time imagining such a thing. Can you give an
>>> > Yes, U+FA9A as described in
>>> https://en.wikipedia.org/wiki/Han_unification has this situation with
>>> > These characters are distinct as members of a coded character set, but
>>> as abstract characters, I do not believe we can easily say the same.
>>> I would expect these to be two different abstract characters in the C++
>>> Roughly, anything you can distinguish in the source should be a different
>>> "abstract character", if only for the benefit of raw string literals.
>> "abstract character" is a notion for humans
>> and to talk about mapping from one set to another.
>> After phase 1, C++ deals with code points such that two sequences of code
>> points are identical if and only if they have the same values
> This seems to be a source of getting hung up on terminology. I think this
> could help: The above sentence can be read as a tautology. A "code point"
> (within a coded character set) is synonymous with the value component of a
> coded character within that coded character set. Unfortunately, "value in
> the UCS codespace" is chosen as the "definition" for "code point" in
> ISO/IEC 10646.
It is a bit tautologic yes :)
>> I don't think we should entertain any notion of "same character" in C++,
>>> beyond value comparisons in the execution encoding and "identity" as
>>> needed for "same identifier".
>> We need to in/before phase 1, but I think we reached the consensus that
>> we otherwise
>> shouldn't and wouldn't
> To be clear, we need to make sure we are on the same page with respect to
> the meta (notion of) notion of "same character":
> By "character", do we mean an "abstract character" or a "coded character"?
abstract character in phase 1 ( to get rid of "abstract character" in phase
1, we would have to assume that we have encoded text already - I think that
would be a reasonable assumption )
>>> For example, if some hypothetical input format differentiates red and
>>> green letters that are otherwise "the same", I'd still expect a red A
>>> to be a different abstract character than a green A. (Ok, that doesn't
>>> work for the basic source character set, but should work for anything
>>> beyond that.)
>> It doesn't work as there isn't any culture on earth that make that
>> distinction such that there exist no universal-character-name to make that
>> It is best left to people of letter to decide whether colors carry
>> meaning (and they sometimes do
>> If that means the term "character" or "abstract character" is too loaded
>>> to be used here, so be it. (The terminology space is already fairly
>>> crowded due to Unicode, so it's hard to find unused phrases that give the
>>> right connotation.)
>> The terminology used by Unicode people isn't Unicode specific. In
>> particular, "abstract character" is meaningful independently of
>> any computer system.
> I think that the relationships between terms represent an ideal that is
> not met in practice. "Abstract character" is a meaningful notion; however,
> the ideal that coded character sets are a bijective function between values
> in a codespace and abstract characters has not been clearly attained.
Coded characters sets encode a set of abstract characters (unicode has
Somer abstract characters do not exist in any coded character set. There
are abstract characters not yet represented in computers that cannot be
handled by a C++ implementation
Unicode encodes 140 000+ characters.
The important point is that all characters that can be represented in any
encoding supported by xlC, clang, msvc, gcc, edg, etc on any system do have
a mapping in Unicode ( in the case of EBCDIC, for the control characters (
which are hardly characters as people understand that word), the mapping is
more prescriptive than it is semantic, but a mapping do exist).
SG16 list run by firstname.lastname@example.org