On Sun, 14 Jun 2020 at 23:54, Tom Honermann <tom@honermann.net> wrote:

On 6/14/20 4:19 PM, Corentin Jabot wrote:

On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 14/06/2020 21.33, Corentin Jabot via SG16 wrote:
> I agree, but per other messages in this and other threads, experts haven't fully defined mappings between character sets that fully preserves semantics and we seem to be aware of implementations that are impacted.
>
> Either they have, they will or they won't, it hardly should fall under the purview of the C++ committee :)

If implementations of C++ are impacted by the choice of C++ to
weave more of Unicode into its specification, I think that's very
much under the purview of the C++ committee.

> The raw literal magic reversion suggests to me that, post phase 1, something more is needed than just basic source characters + UCNs or just code points.
>
> I would like someone to give me 1 example of that :)
> Also the raw literal magic reversion has nothing to do with any of it?

Consider LATIN CAPITAL LETTER A WITH DIAERESIS in a UTF-8 world.

If I write Ä in the original source, I expect to get exactly
that character in a raw string literal.

If I write the (otherwise equivalent) \u00C4 in the original
source, I expect to get the six (ASCII) characters \u00C4
in a raw string literal.

I don't think anyone suggested that should not happen,

and we all seem to agree that this reversal is a hack, but it works

I'm not sure that it actually works. If the source input is an image, what does it mean to revert the phase 1 translation? To copy the bits of the image corresponding to the character into the raw string literal? The question gets more ridiculous if non-digital sources are considered.

Actually, the more I think about it, the less I understand what the wording is trying to preserve or not in raw literals :)

I am not sure the intent is properly described

However, as written, the specification says that Ä is turned
into \u00C4 in phase 1. Unless hidden information is attached
to \u00C4, the compiler doesn't know whether \u00C4 should be
reversed to Ä in a raw string literal, or not.

This is the "magic reversal" we're talking about: The issue is
that the specification is silent about the hidden information.
However, the fact that the hidden information must exist is a
sign that either "just [Unicode] code points" or "just basic
source character set plus UCNs" does not convey enough
information.

Yes, I know, and I am not suggesting that the behavior of any implementation should change in this regard.

It might be beneficial to convert UCN escape sequences that appear verbatim in source files later in the translation process

but i have not yet explored that idea enough to figure whether it would be sensible.

I've been having similar thoughts.

I don't think we would be introducing a new issue by changing the wording or the design. These escape sequences

have to be tracked in the wording regardless, but maybe we are saying that we want to improve or find a better solution to the magic reversal thing?

I will think about it :)

I would like to find a better solution. For the moment though, I'm more using it as a mechanism to help develop my mental model of how this all needs to work.

Tom.