Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-14 12:21:17
On 6/14/20 5:03 AM, Corentin Jabot via SG16 wrote:
> On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> On 11/06/2020 00.06, Hubert Tong wrote:
> > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
> >Â Â Â On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
> >Â Â Â > I agree with Corentin's point that the strict use of
> abstract characters introduces problems where a coded character
> set contains multiple values for a single abstract
> character/contains characters that are canonically the same but
> assigned different values.
> >Â Â Â I have a hard time imagining such a thing.Â Can you give an
> > Yes, U+FA9A as described in
> https://en.wikipedia.org/wiki/Han_unification has this situation
> with U+6F22.
> > These characters are distinct as members of a coded character
> set, but as abstract characters, I do not believe we can easily
> say the same.
> I would expect these to be two different abstract characters in
> the C++ sense.
> Roughly, anything you can distinguish in the source should be a
> "abstract character", if only for the benefit of raw string literals.
> "abstract character" is a notion for humans and to talk about mapping
> from one set to another.
> After phaseÂ 1, C++ deals with code points such that two sequences of
> code points are identical if and only if they have the same values
I think Jens' point stands; two different abstract characters in the
source should be differentiated post translation phase 1.
Post translation phase 1, all we have are basic source characters and
UCNs (with magical revert of UCNs for raw string literals); not code points.
> I don't think we should entertain any notion of "same character"
> in C++,
> beyond value comparisons in the execution encoding and "identity" as
> needed for "same identifier".
> We need to in/before phase 1, but I think we reached the consensus
> that we otherwise
> shouldn't and wouldn't
I'm not sure we need to in phase 1 either.Â The only cases would be for
conversion from source file characters that have multiple
representations for the same semantic character, or (arguably) for
Unicode normalization (which I believe we have consensus should not be
performed in translation phase 1; in other words, EGCs are not
"characters" for the purposes of translation phase 1).
> For example, if some hypothetical input format differentiates red and
> green letters that are otherwise "the same", I'd still expect a red A
> to be a different abstract character than a green A.Â (Ok, that
> work for the basic source character set, but should work for anything
> beyond that.)
> It doesn't work as there isn't any culture on earth that make that
> distinction such that there exist no universal-character-name to make
> that distinction.
> It is best left to people of letter to decide whether colors carry
> meaning (and they sometimes do
I believe Jens was just illustrating a hypothetical argument for the
purpose of advancing the point that differently encoded source input
should be preserved.
> If that means the term "character" or "abstract character" is too
> to be used here, so be it.Â (The terminology space is already fairly
> crowded due to Unicode, so it's hard to find unused phrases that
> give the
> right connotation.)
> The terminology used by Unicode people isn't Unicode specific. In
> particular,Â Â "abstract character" is meaningful independentlyÂ of
> any computer system.
I tend to agree, but some terms such as "code point" are defined in
ISO/IEC 10646 as Unicode specific.Â We'll need to be careful about use
of such terms that are reachable from our normative references.
> In general, I'm still hoping that a compiler in an EBCDIC-only world
> can fit seamlessly in our future model.
> Yes, that is definitively a primary goal :)
> In general we shouldn't restrictÂ the set of possible source character
> sets or execution character sets more than they are currently.
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
SG16 list run by email@example.com