I think I get it - “text” is represented in the execution character set,
whereas u8”text” is represented directly as UTF-8.  In both cases
the abstract machine’s internal representation will be UTF-8, we
are simply moving around where the standard talks about some of
the transcoding occurs.

Also, as you cite ISO 10646:1993 for provenance, it might be worth
pointing out how regularly that standard has been updated, and our
current reference is to a document withdrawn by ISO in 2003, two
Full decades before our planned C++23 publication!

Latest standard is 2017, with the current FDIS for 2020 out for review,
pending CD ballot: https://www.iso.org/standard/76835.html


AlisdairM

On Aug 24, 2020, at 12:32, Peter Brett <pbrett@cadence.com> wrote:

Hi Alisdair,

Thank you for the feedback.  That's a very good suggestion, thank you.  It ties into the suggested change to processing of UCNs that we've discussed a few times.

When you have a u8"" literal, the associated literal encoding is UTF-8.  When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.

Best regards,

                     Peter

-----Original Message-----
From: Alisdair Meredith <alisdairm@me.com>
Sent: 24 August 2020 17:29
To: SG16 <sg16@lists.isocpp.org>
Cc: Peter Brett <pbrett@cadence.com>; Corentin <corentin.jabot@gmail.com>
Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

EXTERNAL MAIL


Minor suggestion on the wording,

You strike the mapping of non-basic source code characters to
universal-character-name, including the cross-reference to such
mappings reverting in raw string literals (5.4).  I suggest making
a matching edit to strike the reference in (5.4)p3 as well, so that
the only thing reverted is line splicing in phase 2.

That said, with these changes, I am curious what the difference
is between a u8 string literal and a plain ‘char’ string literal, as
the contents of that literal are now going to be unicode source
Text (rather than requesting a mapping from source to unicode
of literal’s contents)?

AlisdairM

On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16@lists.isocpp.org>
wrote:

Hi all,

In this week's meeting, we are going to discuss the remaining
proposals from P2178R1 "Misc lexing and string handling improvements".
In particular, we will discuss proposal 9:

  Proposal 9: Reaffirming Unicode as the character set of the
  internal representation

In anticipation of a lively discussion, Corentin and I have written a
short new paper which will be appearing in the September mailing.

  P2194R0 The character set of C++ source code is Unicode

https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__;!!
EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
55f2FH74uxdpLQ$

We hope that the study group finds this contribution helpful and
informative.

Best regards,

                     Peter

--
SG16 mailing list
SG16@lists.isocpp.org

https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
16__;!!EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
55f2FH7Fxs6f2w$