Looking into the wording more deeply, you are going to have to address
the notion of characters in identifiers, if you are no longer mapping into
UCNs, so we can no longer rely on non-digit handling that mapping for

This is probably best addressed in terms of
rather than trying for a different fix that will roll into that later.

Note that I was looking into a much more limited requirement to merely
requiore support for UTF-8 encoded source files as a well-speficied
mapping, rather than switching to unicode as the replacement for the
basic source character set.  Once P1945 is adopted, I will have an
easier time buying into this paper, but in isolation is raises too many
tricky questions for me about unicode text in surprising places.  I
suspect we will need a deeper review of the grammar around identiiers
and other tokens to flush all the issues out - which is largely what P1945


On Aug 24, 2020, at 12:32, Peter Brett <pbrett@cadence.com> wrote:

Hi Alisdair,

Thank you for the feedback.  That's a very good suggestion, thank you.  It ties into the suggested change to processing of UCNs that we've discussed a few times.

When you have a u8"" literal, the associated literal encoding is UTF-8.  When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.

Best regards,


-----Original Message-----
From: Alisdair Meredith <alisdairm@me.com>
Sent: 24 August 2020 17:29
To: SG16 <sg16@lists.isocpp.org>
Cc: Peter Brett <pbrett@cadence.com>; Corentin <corentin.jabot@gmail.com>
Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode


Minor suggestion on the wording,

You strike the mapping of non-basic source code characters to
universal-character-name, including the cross-reference to such
mappings reverting in raw string literals (5.4).  I suggest making
a matching edit to strike the reference in (5.4)p3 as well, so that
the only thing reverted is line splicing in phase 2.

That said, with these changes, I am curious what the difference
is between a u8 string literal and a plain ‘char’ string literal, as
the contents of that literal are now going to be unicode source
Text (rather than requesting a mapping from source to unicode
of literal’s contents)?


On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16@lists.isocpp.org>

Hi all,

In this week's meeting, we are going to discuss the remaining
proposals from P2178R1 "Misc lexing and string handling improvements".
In particular, we will discuss proposal 9:

  Proposal 9: Reaffirming Unicode as the character set of the
  internal representation

In anticipation of a lively discussion, Corentin and I have written a
short new paper which will be appearing in the September mailing.

  P2194R0 The character set of C++ source code is Unicode


We hope that the study group finds this contribution helpful and

Best regards,


SG16 mailing list