sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sun, 14 Jun 2020 21:55:44 +0200

On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
> I agree, but per other messages in this and other threads, experts haven't fully defined mappings between character sets that fully preserves semantics and we seem to be aware of implementations that are impacted.
>
> Either they have, they will or they won't, it hardly should fall under the purview of the C++ committee :)

If implementations of C++ are impacted by the choice of C++ to
weave more of Unicode into its specification, I think that's very
much under the purview of the C++ committee.

> The raw literal magic reversion suggests to me that, post phase 1, something more is needed than just basic source characters + UCNs or just code points.
>
> I would like someone to give me 1 example of that :)
> Also the raw literal magic reversion has nothing to do with any of it?

Consider LATIN CAPITAL LETTER A WITH DIAERESIS in a UTF-8 world.

If I write Ä in the original source, I expect to get exactly
that character in a raw string literal.

If I write the (otherwise equivalent) \u00C4 in the original
source, I expect to get the six (ASCII) characters \u00C4
in a raw string literal.

However, as written, the specification says that Ä is turned
into \u00C4 in phase 1. Unless hidden information is attached
to \u00C4, the compiler doesn't know whether \u00C4 should be
reversed to Ä in a raw string literal, or not.

This is the "magic reversal" we're talking about: The issue is
that the specification is silent about the hidden information.
However, the fact that the hidden information must exist is a
sign that either "just [Unicode] code points" or "just basic
source character set plus UCNs" does not convey enough
information.

>>> I don't think we should entertain any notion of "same character" in C++,
>>> beyond value comparisons in the execution encoding and "identity" as
>>> needed for "same identifier".
>>>
>>>
>>> We need to in/before phase 1, but I think we reached the consensus that we otherwise
>>> shouldn't and wouldn't
>> I'm not sure we need to in phase 1 either. The only cases would be for conversion from source file characters that have multiple representations for the same semantic character, or (arguably) for Unicode normalization (which I believe we have consensus should not be performed in translation phase 1; in other words, EGCs are not "characters" for the purposes of translation phase 1).
>>
>>
>> In phase 1 we need _something_
>> Abstract character ( which is exactly what the standard calls "Physical Character" ) let us talk about the picture of the code case.
> In phase 1, we need the concept of identity in order to map the source input to the basic source character set + UCNs. I think Jens was arguing more that we do not need (and should not need) the concept of equivalence.
>
>
> Sure (as long as we accept that 1 abstract character may map to a sequence of code points (or UCNs))

No, each code point in a sequence (given Unicode input) is a separate abstract character
in my view (after combining surrogate pairs, of course).

>>> For example, if some hypothetical input format differentiates red and
>>> green letters that are otherwise "the same", I'd still expect a red A
>>> to be a different abstract character than a green A. (Ok, that doesn't
>>> work for the basic source character set, but should work for anything
>>> beyond that.)
>>>
>>>
>>> It doesn't work as there isn't any culture on earth that make that distinction such that there exist no universal-character-name to make that distinction.
>>> It is best left to people of letter to decide whether colors carry meaning (and they sometimes do https://en.wikipedia.org/wiki/Ersu_Shaba_script)
>> I believe Jens was just illustrating a hypothetical argument for the purpose of advancing the point that differently encoded source input should be preserved.
>>
>>
>> Yes, and I was explaining why that was not necessary
> I think your response focused too much on color; think of it as a charmed A and a strange A if that helps. The point was that conversion to source character set should not be lossy and we know of cases where it is lossy today.
>
>
> Again I would like to see an example of that

I thought the EBCDIC control characters are an example of a lossy conversion.
Unicode can define that problem as "out of scope" for them, but that
doesn't mean it goes away from a C++ perspective.

>>> If that means the term "character" or "abstract character" is too loaded
>>> to be used here, so be it. (The terminology space is already fairly
>>> crowded due to Unicode, so it's hard to find unused phrases that give the
>>> right connotation.)
>>>
>>>
>>> The terminology used by Unicode people isn't Unicode specific. In particular, "abstract character" is meaningful independently of
>>> any computer system.
>> I tend to agree, but some terms such as "code point" are defined in ISO/IEC 10646 as Unicode specific. We'll need to be careful about use of such terms that are reachable from our normative references.
>>
>>
>> There is a ton of precedence for using code point and code unit for arbitrary encoding - even if the terms originate from Unicode
>
> I agree, and I would like to use those terms. I'm not sure if we can use "code point" though because of its definition in ISO/IEC 10646 unless we provide an alternate definition (which I don't know if we can from an ISO perspective).
>
> I think we agreed that we should define the terms regardless of their existence in ISO/IEC 10646 ? Are there some iso constraints on that?

We're normatively referring to ISO 10646, so I think it would be actively bad
if we were to redefine Unicode terms to mean something else in the general
context of text and characters. We should seek input from the project editor.

Jens

Received on 2020-06-14 14:58:59