sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 14 Jun 2020 22:19:06 +0200

On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
> > I agree, but per other messages in this and other threads, experts
> haven't fully defined mappings between character sets that fully preserves
> semantics and we seem to be aware of implementations that are impacted.
> >
> > Either they have, they will or they won't, it hardly should fall under
> the purview of the C++ committee :)
>
> If implementations of C++ are impacted by the choice of C++ to
> weave more of Unicode into its specification, I think that's very
> much under the purview of the C++ committee.
>
> > The raw literal magic reversion suggests to me that, post phase 1,
> something more is needed than just basic source characters + UCNs or just
> code points.
> >
> > I would like someone to give me 1 example of that :)
> > Also the raw literal magic reversion has nothing to do with any of it?
>
> Consider LATIN CAPITAL LETTER A WITH DIAERESIS in a UTF-8 world.
>
> If I write Ä in the original source, I expect to get exactly
> that character in a raw string literal.
>
> If I write the (otherwise equivalent) \u00C4 in the original
> source, I expect to get the six (ASCII) characters \u00C4
> in a raw string literal.
>

I don't think anyone suggested that should not happen,
and we all seem to agree that this reversal is a hack, but it works

>
> However, as written, the specification says that Ä is turned
> into \u00C4 in phase 1. Unless hidden information is attached
> to \u00C4, the compiler doesn't know whether \u00C4 should be
> reversed to Ä in a raw string literal, or not.
>
> This is the "magic reversal" we're talking about: The issue is
> that the specification is silent about the hidden information.
> However, the fact that the hidden information must exist is a
> sign that either "just [Unicode] code points" or "just basic
> source character set plus UCNs" does not convey enough
> information.

Yes, I know, and I am not suggesting that the behavior of any
implementation should change in this regard.
It might be beneficial to convert UCN escape sequences that appear
verbatim in source files later in the translation process
but i have not yet explored that idea enough to figure whether it would be
sensible.

I don't think we would be introducing a new issue by changing the wording
or the design. These escape sequences
have to be tracked in the wording regardless, but maybe we are saying that
we want to improve or find a better solution to the magic reversal thing?
I will think about it :)

>
> >>> I don't think we should entertain any notion of "same
> character" in C++,
> >>> beyond value comparisons in the execution encoding and
> "identity" as
> >>> needed for "same identifier".
> >>>
> >>>
> >>> We need to in/before phase 1, but I think we reached the
> consensus that we otherwise
> >>> shouldn't and wouldn't
> >> I'm not sure we need to in phase 1 either. The only cases
> would be for conversion from source file characters that have multiple
> representations for the same semantic character, or (arguably) for Unicode
> normalization (which I believe we have consensus should not be performed in
> translation phase 1; in other words, EGCs are not "characters" for the
> purposes of translation phase 1).
> >>
> >>
> >> In phase 1 we need _something_
> >> Abstract character ( which is exactly what the standard calls
> "Physical Character" ) let us talk about the picture of the code case.
> > In phase 1, we need the concept of identity in order to map the
> source input to the basic source character set + UCNs. I think Jens was
> arguing more that we do not need (and should not need) the concept of
> equivalence.
> >
> >
> > Sure (as long as we accept that 1 abstract character may map to a
> sequence of code points (or UCNs))
>
> No, each code point in a sequence (given Unicode input) is a separate
> abstract character
> in my view (after combining surrogate pairs, of course).
>

For example diatrics, when preceded by a letter are not considered abstract
characters of their own.
That will also include many emojis, many hangul characters and probably a
tons of other scripts
(But again, after phase 1 this distinction does not matter)

>
> >>> For example, if some hypothetical input format
> differentiates red and
> >>> green letters that are otherwise "the same", I'd still
> expect a red A
> >>> to be a different abstract character than a green A. (Ok,
> that doesn't
> >>> work for the basic source character set, but should work
> for anything
> >>> beyond that.)
> >>>
> >>>
> >>> It doesn't work as there isn't any culture on earth that make
> that distinction such that there exist no universal-character-name to make
> that distinction.
> >>> It is best left to people of letter to decide whether colors
> carry meaning (and they sometimes do
> https://en.wikipedia.org/wiki/Ersu_Shaba_script)
> >> I believe Jens was just illustrating a hypothetical argument
> for the purpose of advancing the point that differently encoded source
> input should be preserved.
> >>
> >>
> >> Yes, and I was explaining why that was not necessary
> > I think your response focused too much on color; think of it as a
> charmed A and a strange A if that helps. The point was that conversion to
> source character set should not be lossy and we know of cases where it is
> lossy today.
> >
> >
> > Again I would like to see an example of that
>
> I thought the EBCDIC control characters are an example of a lossy
> conversion.
> Unicode can define that problem as "out of scope" for them, but that
> doesn't mean it goes away from a C++ perspective.
>

Unicode defines the control character as "application specific", which I
should have realized sooner (see one of my other answer in the thread)

>
> >>> If that means the term "character" or "abstract character"
> is too loaded
> >>> to be used here, so be it. (The terminology space is
> already fairly
> >>> crowded due to Unicode, so it's hard to find unused
> phrases that give the
> >>> right connotation.)
> >>>
> >>>
> >>> The terminology used by Unicode people isn't Unicode specific.
> In particular, "abstract character" is meaningful independently of
> >>> any computer system.
> >> I tend to agree, but some terms such as "code point" are
> defined in ISO/IEC 10646 as Unicode specific. We'll need to be careful
> about use of such terms that are reachable from our normative references.
> >>
> >>
> >> There is a ton of precedence for using code point and code unit for
> arbitrary encoding - even if the terms originate from Unicode
> >
> > I agree, and I would like to use those terms. I'm not sure if we
> can use "code point" though because of its definition in ISO/IEC 10646
> unless we provide an alternate definition (which I don't know if we can
> from an ISO perspective).
> >
> > I think we agreed that we should define the terms regardless of their
> existence in ISO/IEC 10646 ? Are there some iso constraints on that?
>
> We're normatively referring to ISO 10646, so I think it would be actively
> bad
> if we were to redefine Unicode terms to mean something else in the general
> context of text and characters. We should seek input from the project
> editor.
>
> Jens
>

Received on 2020-06-14 15:22:27