C++ Logo

SG16

Advanced search

Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-14 16:54:14


On 6/14/20 4:19 PM, Corentin Jabot wrote:
>
>
> On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
> >     I agree, but per other messages in this and other threads,
> experts haven't fully defined mappings between character sets that
> fully preserves semantics and we seem to be aware of
> implementations that are impacted.
> >
> > Either they have, they will or they won't, it hardly should fall
> under the purview of the C++ committee :)
>
> If implementations of C++ are impacted by the choice of C++ to
> weave more of Unicode into its specification, I think that's very
> much under the purview of the C++ committee.
>
> >     The raw literal magic reversion suggests to me that, post
> phase 1, something more is needed than just basic source
> characters + UCNs or just code points.
> >
> > I would like someone to give me 1 example of that :)
> > Also the raw literal magic reversion has nothing to do with any
> of it?
>
> Consider LATIN CAPITAL LETTER A WITH DIAERESIS in a UTF-8 world.
>
> If I write Ä in the original source, I expect to get exactly
> that character in a raw string literal.
>
> If I write the (otherwise equivalent) \u00C4 in the original
> source, I expect to get the six (ASCII) characters \u00C4
> in a raw string literal.
>
>
> I don't think anyone suggested that should not happen,
> and we all seem to agree that this reversal is a hack, but it works
I'm not sure that it actually works.  If the source input is an image,
what does it mean to revert the phase 1 translation?  To copy the bits
of the image corresponding to the character into the raw string
literal?  The question gets more ridiculous if non-digital sources are
considered.
>
>
> However, as written, the specification says that Ä is turned
> into \u00C4 in phase 1.  Unless hidden information is attached
> to \u00C4, the compiler doesn't know whether \u00C4 should be
> reversed to Ä in a raw string literal, or not.
>
> This is the "magic reversal" we're talking about: The issue is
> that the specification is silent about the hidden information.
> However, the fact that the hidden information must exist is a
> sign that either "just [Unicode] code points" or "just basic
> source character set plus UCNs" does not convey enough
> information.
>
> Yes, I know, and I am not suggesting that the behavior of any
> implementation should change in this regard.
> It might be beneficial to convert UCN escape sequences that appear
> verbatim in source files later in the translation process
> but i have not yet explored that idea enough to figure whether it
> would be sensible.
I've been having similar thoughts.
>
> I don't think we would be introducing a new issue by changing the
> wording or the design. These escape sequences
> have to be tracked in the wording regardless, but maybe we are saying
> that we want to improve or find a better solution to the magic
> reversal thing?
> I will think about it :)

I would like to find a better solution.  For the moment though, I'm more
using it as a mechanism to help develop my mental model of how this all
needs to work.

Tom.



SG16 list run by sg16-owner@lists.isocpp.org