On Mon, Aug 24, 2020 at 3:44 PM Alisdair Meredith via SG16 <sg16@lists.isocpp.org> wrote:

Got another good corner case for you!

In the template form of user defined literals, the template parameter pack
is instiated with characters corresponding to the source text, currently
mapping non-basic characters to UCNs, so that the template parser can
assume all characters are members of the basic source character set:

See [lex.ext] 5.13.8p3/4

By no longer mapping to UCNs, we break any UDL parsers that work with
UCNs today. I don’t know how many there are in production, possibly zero,
but it is a risk to address, and provide an entry in compatibility Annex C.

I am currently searching the standard for the phrase “source character” and
trying to make sense of the difference between “source character set” and
“basic source character set”. The former seems to refer to some mythical
thing that exists prior to conversion to UCNs, but applies to text being
processed /after/ UCNification, where it is not clear that is makes a real
distinction at that point.

Good examples are the h-char and q-char sequences for header names.
The current text just looks broken for header names outside the basic
source character set, as the text we actually parse is post-UCNification,
but it is also conditionally supported behavior to have a ‘\’ character in such
a char-sequence, indicating that post-UCNified text is problematic.

I believe this paper will be more than the light treatment you seem to expect,
but it will shake out and fix a few dusty corners giving us a more robust spec
as part of the process - and that would be another feature of the proposal
that I could get behind!

The previous discussions were already leading towards a direction that would address such corners through removing UCNification. I am not sure why the decision to restrict the processing to characters representable in Unicode is not considered a separable question.

AlisdairM

On Aug 24, 2020, at 12:32, Peter Brett <pbrett@cadence.com> wrote:

Hi Alisdair,

Thank you for the feedback. That's a very good suggestion, thank you. It ties into the suggested change to processing of UCNs that we've discussed a few times.

When you have a u8"" literal, the associated literal encoding is UTF-8. When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.

Best regards,

                     Peter

-----Original Message-----
From: Alisdair Meredith <alisdairm@me.com>
Sent: 24 August 2020 17:29
To: SG16 <sg16@lists.isocpp.org>
Cc: Peter Brett <pbrett@cadence.com>; Corentin <corentin.jabot@gmail.com>
Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

EXTERNAL MAIL

Minor suggestion on the wording,

You strike the mapping of non-basic source code characters to
universal-character-name, including the cross-reference to such
mappings reverting in raw string literals (5.4). I suggest making
a matching edit to strike the reference in (5.4)p3 as well, so that
the only thing reverted is line splicing in phase 2.

That said, with these changes, I am curious what the difference
is between a u8 string literal and a plain ‘char’ string literal, as
the contents of that literal are now going to be unicode source
Text (rather than requesting a mapping from source to unicode
of literal’s contents)?

AlisdairM

On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16@lists.isocpp.org>
wrote:

Hi all,

In this week's meeting, we are going to discuss the remaining
proposals from P2178R1 "Misc lexing and string handling improvements".
In particular, we will discuss proposal 9:

  Proposal 9: Reaffirming Unicode as the character set of the
  internal representation

In anticipation of a lively discussion, Corentin and I have written a
short new paper which will be appearing in the September mailing.

  P2194R0 The character set of C++ source code is Unicode

https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__;!!
EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
55f2FH74uxdpLQ$

We hope that the study group finds this contribution helpful and
informative.

Best regards,

                     Peter

--
SG16 mailing list
SG16@lists.isocpp.org

https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
16__;!!EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
55f2FH7Fxs6f2w$

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16