sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 15 Jun 2020 00:26:06 +0200

On Sun, 14 Jun 2020 at 23:54, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/14/20 4:19 PM, Corentin Jabot wrote:
>
>
>
> On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
>> > I agree, but per other messages in this and other threads, experts
>> haven't fully defined mappings between character sets that fully preserves
>> semantics and we seem to be aware of implementations that are impacted.
>> >
>> > Either they have, they will or they won't, it hardly should fall under
>> the purview of the C++ committee :)
>>
>> If implementations of C++ are impacted by the choice of C++ to
>> weave more of Unicode into its specification, I think that's very
>> much under the purview of the C++ committee.
>>
>> > The raw literal magic reversion suggests to me that, post phase 1,
>> something more is needed than just basic source characters + UCNs or just
>> code points.
>> >
>> > I would like someone to give me 1 example of that :)
>> > Also the raw literal magic reversion has nothing to do with any of it?
>>
>> Consider LATIN CAPITAL LETTER A WITH DIAERESIS in a UTF-8 world.
>>
>> If I write Ä in the original source, I expect to get exactly
>> that character in a raw string literal.
>>
>> If I write the (otherwise equivalent) \u00C4 in the original
>> source, I expect to get the six (ASCII) characters \u00C4
>> in a raw string literal.
>>
>
> I don't think anyone suggested that should not happen,
> and we all seem to agree that this reversal is a hack, but it works
>
> I'm not sure that it actually works. If the source input is an image,
> what does it mean to revert the phase 1 translation? To copy the bits of
> the image corresponding to the character into the raw string literal? The
> question gets more ridiculous if non-digital sources are considered.
>

Actually, the more I think about it, the less I understand what the wording
is trying to preserve or not in raw literals :)
I am not sure the intent is properly described

>
>>
>> However, as written, the specification says that Ä is turned
>> into \u00C4 in phase 1. Unless hidden information is attached
>> to \u00C4, the compiler doesn't know whether \u00C4 should be
>> reversed to Ä in a raw string literal, or not.
>>
>> This is the "magic reversal" we're talking about: The issue is
>> that the specification is silent about the hidden information.
>> However, the fact that the hidden information must exist is a
>> sign that either "just [Unicode] code points" or "just basic
>> source character set plus UCNs" does not convey enough
>> information.
>
>
> Yes, I know, and I am not suggesting that the behavior of any
> implementation should change in this regard.
> It might be beneficial to convert UCN escape sequences that appear
> verbatim in source files later in the translation process
> but i have not yet explored that idea enough to figure whether it would be
> sensible.
>
> I've been having similar thoughts.
>
>
> I don't think we would be introducing a new issue by changing the wording
> or the design. These escape sequences
> have to be tracked in the wording regardless, but maybe we are saying that
> we want to improve or find a better solution to the magic reversal thing?
> I will think about it :)
>
> I would like to find a better solution. For the moment though, I'm more
> using it as a mechanism to help develop my mental model of how this all
> needs to work.
>
> Tom.
>

Received on 2020-06-14 17:29:28