sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 14 Jun 2020 17:54:14 -0400

On 6/14/20 4:19 PM, Corentin Jabot wrote:
>
>
> On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
> > I agree, but per other messages in this and other threads,
> experts haven't fully defined mappings between character sets that
> fully preserves semantics and we seem to be aware of
> implementations that are impacted.
> >
> > Either they have, they will or they won't, it hardly should fall
> under the purview of the C++ committee :)
>
> If implementations of C++ are impacted by the choice of C++ to
> weave more of Unicode into its specification, I think that's very
> much under the purview of the C++ committee.
>
> > The raw literal magic reversion suggests to me that, post
> phase 1, something more is needed than just basic source
> characters + UCNs or just code points.
> >
> > I would like someone to give me 1 example of that :)
> > Also the raw literal magic reversion has nothing to do with any
> of it?
>
> Consider LATIN CAPITAL LETTER A WITH DIAERESIS in a UTF-8 world.
>
> If I write Ä in the original source, I expect to get exactly
> that character in a raw string literal.
>
> If I write the (otherwise equivalent) \u00C4 in the original
> source, I expect to get the six (ASCII) characters \u00C4
> in a raw string literal.
>
>
> I don't think anyone suggested that should not happen,
> and we all seem to agree that this reversal is a hack, but it works
I'm not sure that it actually works. If the source input is an image,
what does it mean to revert the phase 1 translation? To copy the bits
of the image corresponding to the character into the raw string
literal? The question gets more ridiculous if non-digital sources are
considered.
>
>
> However, as written, the specification says that Ä is turned
> into \u00C4 in phase 1. Unless hidden information is attached
> to \u00C4, the compiler doesn't know whether \u00C4 should be
> reversed to Ä in a raw string literal, or not.
>
> This is the "magic reversal" we're talking about: The issue is
> that the specification is silent about the hidden information.
> However, the fact that the hidden information must exist is a
> sign that either "just [Unicode] code points" or "just basic
> source character set plus UCNs" does not convey enough
> information.
>
> Yes, I know, and I am not suggesting that the behavior of any
> implementation should change in this regard.
> It might be beneficial to convert UCN escape sequences that appear
> verbatim in source files later in the translation process
> but i have not yet explored that idea enough to figure whether it
> would be sensible.
I've been having similar thoughts.
>
> I don't think we would be introducing a new issue by changing the
> wording or the design. These escape sequences
> have to be tracked in the wording regardless, but maybe we are saying
> that we want to improve or find a better solution to the magic
> reversal thing?
> I will think about it :)

I would like to find a better solution. For the moment though, I'm more
using it as a mechanism to help develop my mental model of how this all
needs to work.

Tom.

Received on 2020-06-14 16:57:25