Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-14 16:54:14
On 6/14/20 4:19 PM, Corentin Jabot wrote:
> On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
> On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
> >Â Â Â I agree, but per other messages in this and other threads,
> experts haven't fully defined mappings between character sets that
> fully preserves semantics and we seem to be aware of
> implementations that are impacted.
> > Either they have, they will or they won't, it hardly should fall
> under theÂ purview of the C++ committeeÂ :)
> If implementations of C++ are impacted by the choice of C++ to
> weave more of Unicode into its specification, I think that's very
> much under the purview of the C++ committee.
> >Â Â Â The raw literal magic reversion suggests to me that, post
> phase 1, something more is needed than just basic source
> characters + UCNs or just code points.
> > I would like someone to give me 1 example of that :)
> > Also theÂ raw literal magic reversion has nothing to do with any
> of it?
> Consider LATIN CAPITAL LETTER A WITH DIAERESIS in a UTF-8 world.
> If I write Ã in the original source, I expect to get exactly
> that character in a raw string literal.
> If I write the (otherwise equivalent) \u00C4 in the original
> source, I expect to get the six (ASCII) characters \u00C4
> in a raw string literal.
> I don't think anyone suggested that should not happen,
> and we all seem to agree that this reversal is a hack, but it works
I'm not sure that it actually works.Â If the source input is an image,
what does it mean to revert the phase 1 translation?Â To copy the bits
of the image corresponding to the character into the raw string
literal?Â The question gets more ridiculous if non-digital sources are
> However, as written, the specification says that Ã is turned
> into \u00C4 in phase 1.Â Unless hidden information is attached
> to \u00C4, the compiler doesn't know whether \u00C4 should be
> reversed to Ã in a raw string literal, or not.
> This is the "magic reversal" we're talking about: The issue is
> that the specification is silent about the hidden information.
> However, the fact that the hidden information must exist is a
> sign that either "just [Unicode] code points" or "just basic
> source character set plus UCNs" does not convey enough
> Yes, I know, and I am not suggesting that the behavior of any
> implementation should change in this regard.
> It might be beneficial to convert UCN escape sequences that appear
> verbatimÂ in source files later in the translationÂ process
> but i have not yet explored that idea enough to figure whether it
> would be sensible.
I've been having similar thoughts.
> I don't think we would be introducing a new issue by changingÂ the
> wording or the design. These escape sequences
> have to be tracked in the wording regardless, but maybe we are saying
> that we want to improve or find a better solution to the magic
> reversal thing?
> I will think about it :)
I would like to find a better solution.Â For the moment though, I'm more
using it as a mechanism to help develop my mental model of how this all
needs to work.
SG16 list run by email@example.com