sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Mon, 24 Aug 2020 16:13:17 -0400

On Mon, Aug 24, 2020 at 3:44 PM Alisdair Meredith via SG16 <
sg16_at_[hidden]> wrote:

> Got another good corner case for you!
>
> In the template form of user defined literals, the template parameter pack
> is instiated with characters corresponding to the source text, currently
> mapping non-basic characters to UCNs, so that the template parser can
> assume all characters are members of the basic source character set:
>
> See [lex.ext] 5.13.8p3/4
>
> By no longer mapping to UCNs, we break any UDL parsers that work with
> UCNs today. I don’t know how many there are in production, possibly zero,
> but it is a risk to address, and provide an entry in compatibility Annex C.
>
> I am currently searching the standard for the phrase “source character” and
> trying to make sense of the difference between “source character set” and
> “basic source character set”. The former seems to refer to some mythical
> thing that exists prior to conversion to UCNs, but applies to text being
> processed /after/ UCNification, where it is not clear that is makes a real
> distinction at that point.
>
> Good examples are the h-char and q-char sequences for header names.
> The current text just looks broken for header names outside the basic
> source character set, as the text we actually parse is post-UCNification,
> but it is also conditionally supported behavior to have a ‘\’ character in
> such
> a char-sequence, indicating that post-UCNified text is problematic.
>
> I believe this paper will be more than the light treatment you seem to
> expect,
> but it will shake out and fix a few dusty corners giving us a more robust
> spec
> as part of the process - and that would be another feature of the proposal
> that I could get behind!
>
The previous discussions were already leading towards a direction that
would address such corners through removing UCNification. I am not sure why
the decision to restrict the processing to characters representable in
Unicode is not considered a separable question.

>
> AlisdairM
>
> On Aug 24, 2020, at 12:32, Peter Brett <pbrett_at_[hidden]> wrote:
>
> Hi Alisdair,
>
> Thank you for the feedback. That's a very good suggestion, thank you. It
> ties into the suggested change to processing of UCNs that we've discussed a
> few times.
>
> When you have a u8"" literal, the associated literal encoding is UTF-8.
> When you have a 'plain' "" string literal, the associated literal encoding
> is implementation-defined.
>
> Best regards,
>
> Peter
>
> -----Original Message-----
> From: Alisdair Meredith <alisdairm_at_[hidden]>
> Sent: 24 August 2020 17:29
> To: SG16 <sg16_at_[hidden]>
> Cc: Peter Brett <pbrett_at_[hidden]>; Corentin <corentin.jabot_at_[hidden]>
> Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode
>
> EXTERNAL MAIL
>
>
> Minor suggestion on the wording,
>
> You strike the mapping of non-basic source code characters to
> universal-character-name, including the cross-reference to such
> mappings reverting in raw string literals (5.4). I suggest making
> a matching edit to strike the reference in (5.4)p3 as well, so that
> the only thing reverted is line splicing in phase 2.
>
> That said, with these changes, I am curious what the difference
> is between a u8 string literal and a plain ‘char’ string literal, as
> the contents of that literal are now going to be unicode source
> Text (rather than requesting a mapping from source to unicode
> of literal’s contents)?
>
> AlisdairM
>
> On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16_at_[hidden]>
>
> wrote:
>
>
> Hi all,
>
> In this week's meeting, we are going to discuss the remaining
> proposals from P2178R1 "Misc lexing and string handling improvements".
> In particular, we will discuss proposal 9:
>
> Proposal 9: Reaffirming Unicode as the character set of the
> internal representation
>
> In anticipation of a lively discussion, Corentin and I have written a
> short new paper which will be appearing in the September mailing.
>
> P2194R0 The character set of C++ source code is Unicode
>
> https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__
> ;!!
> EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
> 55f2FH74uxdpLQ$
>
>
> We hope that the study group finds this contribution helpful and
> informative.
>
> Best regards,
>
> Peter
>
> --
> SG16 mailing list
> SG16_at_[hidden]
>
>
> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
> 16__;!!EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
> 55f2FH7Fxs6f2w$
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-08-24 15:17:05