sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Steve Downey <sdowney_at_[hidden]>
Date: Mon, 24 Aug 2020 17:00:30 -0400

w.r.t. "source character set", "basic source character set",
"execution character set", they seem to be imports from the C
standard, which has a different model of how this all works. In
particular, in use in the C++ standard, they do not really have an
encoding, and instead are abstract symbols. The basic execution
character set is the basic source character set plus the characters
necessary for the defined character escape sequences, and the values
of those characters is controlled by "locale". The grammar really only
understands characters in the basic source character set and universal
character names are made up of a small subset of the basic source
characters. In phase 1 all characters outside the set are converted to
UCNs, and characters inside the BSCS are left alone.

On Mon, Aug 24, 2020 at 3:44 PM Alisdair Meredith via SG16
<sg16_at_[hidden]> wrote:
>
> Got another good corner case for you!
>
> In the template form of user defined literals, the template parameter pack
> is instiated with characters corresponding to the source text, currently
> mapping non-basic characters to UCNs, so that the template parser can
> assume all characters are members of the basic source character set:
>
> See [lex.ext] 5.13.8p3/4
>
> By no longer mapping to UCNs, we break any UDL parsers that work with
> UCNs today. I don’t know how many there are in production, possibly zero,
> but it is a risk to address, and provide an entry in compatibility Annex C.
>
> I am currently searching the standard for the phrase “source character” and
> trying to make sense of the difference between “source character set” and
> “basic source character set”. The former seems to refer to some mythical
> thing that exists prior to conversion to UCNs, but applies to text being
> processed /after/ UCNification, where it is not clear that is makes a real
> distinction at that point.
>
> Good examples are the h-char and q-char sequences for header names.
> The current text just looks broken for header names outside the basic
> source character set, as the text we actually parse is post-UCNification,
> but it is also conditionally supported behavior to have a ‘\’ character in such
> a char-sequence, indicating that post-UCNified text is problematic.
>
> I believe this paper will be more than the light treatment you seem to expect,
> but it will shake out and fix a few dusty corners giving us a more robust spec
> as part of the process - and that would be another feature of the proposal
> that I could get behind!
>
> AlisdairM
>
> On Aug 24, 2020, at 12:32, Peter Brett <pbrett_at_[hidden]> wrote:
>
> Hi Alisdair,
>
> Thank you for the feedback. That's a very good suggestion, thank you. It ties into the suggested change to processing of UCNs that we've discussed a few times.
>
> When you have a u8"" literal, the associated literal encoding is UTF-8. When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.
>
> Best regards,
>
> Peter
>
> -----Original Message-----
> From: Alisdair Meredith <alisdairm_at_[hidden]>
> Sent: 24 August 2020 17:29
> To: SG16 <sg16_at_[hidden]>
> Cc: Peter Brett <pbrett_at_[hidden]>; Corentin <corentin.jabot_at_[hidden]>
> Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode
>
> EXTERNAL MAIL
>
>
> Minor suggestion on the wording,
>
> You strike the mapping of non-basic source code characters to
> universal-character-name, including the cross-reference to such
> mappings reverting in raw string literals (5.4). I suggest making
> a matching edit to strike the reference in (5.4)p3 as well, so that
> the only thing reverted is line splicing in phase 2.
>
> That said, with these changes, I am curious what the difference
> is between a u8 string literal and a plain ‘char’ string literal, as
> the contents of that literal are now going to be unicode source
> Text (rather than requesting a mapping from source to unicode
> of literal’s contents)?
>
> AlisdairM
>
> On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16_at_[hidden]>
>
> wrote:
>
>
> Hi all,
>
> In this week's meeting, we are going to discuss the remaining
> proposals from P2178R1 "Misc lexing and string handling improvements".
> In particular, we will discuss proposal 9:
>
> Proposal 9: Reaffirming Unicode as the character set of the
> internal representation
>
> In anticipation of a lively discussion, Corentin and I have written a
> short new paper which will be appearing in the September mailing.
>
> P2194R0 The character set of C++ source code is Unicode
>
> https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__;!!
> EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
> 55f2FH74uxdpLQ$
>
>
> We hope that the study group finds this contribution helpful and
> informative.
>
> Best regards,
>
> Peter
>
> --
> SG16 mailing list
> SG16_at_[hidden]
>
> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
> 16__;!!EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
> 55f2FH7Fxs6f2w$
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-08-24 16:04:08