C++ Logo


Advanced search

Subject: Re: P2194R0 The character set of C++ source code is Unicode
From: Alisdair Meredith (alisdairm_at_[hidden])
Date: 2020-08-24 14:44:29

Got another good corner case for you!

In the template form of user defined literals, the template parameter pack
is instiated with characters corresponding to the source text, currently
mapping non-basic characters to UCNs, so that the template parser can
assume all characters are members of the basic source character set:

See [lex.ext] 5.13.8p3/4

By no longer mapping to UCNs, we break any UDL parsers that work with
UCNs today. I don’t know how many there are in production, possibly zero,
but it is a risk to address, and provide an entry in compatibility Annex C.

I am currently searching the standard for the phrase “source character” and
trying to make sense of the difference between “source character set” and
“basic source character set”. The former seems to refer to some mythical
thing that exists prior to conversion to UCNs, but applies to text being
processed /after/ UCNification, where it is not clear that is makes a real
distinction at that point.

Good examples are the h-char and q-char sequences for header names.
The current text just looks broken for header names outside the basic
source character set, as the text we actually parse is post-UCNification,
but it is also conditionally supported behavior to have a ‘\’ character in such
a char-sequence, indicating that post-UCNified text is problematic.

I believe this paper will be more than the light treatment you seem to expect,
but it will shake out and fix a few dusty corners giving us a more robust spec
as part of the process - and that would be another feature of the proposal
that I could get behind!


> On Aug 24, 2020, at 12:32, Peter Brett <pbrett_at_[hidden]> wrote:
> Hi Alisdair,
> Thank you for the feedback. That's a very good suggestion, thank you. It ties into the suggested change to processing of UCNs that we've discussed a few times.
> When you have a u8"" literal, the associated literal encoding is UTF-8. When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.
> Best regards,
> Peter
>> -----Original Message-----
>> From: Alisdair Meredith <alisdairm_at_[hidden] <mailto:alisdairm_at_[hidden]>>
>> Sent: 24 August 2020 17:29
>> To: SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>> Cc: Peter Brett <pbrett_at_[hidden] <mailto:pbrett_at_[hidden]>>; Corentin <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>>
>> Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode
>> Minor suggestion on the wording,
>> You strike the mapping of non-basic source code characters to
>> universal-character-name, including the cross-reference to such
>> mappings reverting in raw string literals (5.4). I suggest making
>> a matching edit to strike the reference in (5.4)p3 as well, so that
>> the only thing reverted is line splicing in phase 2.
>> That said, with these changes, I am curious what the difference
>> is between a u8 string literal and a plain ‘char’ string literal, as
>> the contents of that literal are now going to be unicode source
>> Text (rather than requesting a mapping from source to unicode
>> of literal’s contents)?
>> AlisdairM
>>> On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16_at_[hidden]>
>> wrote:
>>> Hi all,
>>> In this week's meeting, we are going to discuss the remaining
>>> proposals from P2178R1 "Misc lexing and string handling improvements".
>>> In particular, we will discuss proposal 9:
>>> Proposal 9: Reaffirming Unicode as the character set of the
>>> internal representation
>>> In anticipation of a lively discussion, Corentin and I have written a
>>> short new paper which will be appearing in the September mailing.
>>> P2194R0 The character set of C++ source code is Unicode
>> https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__ <https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__>;!!
>> EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
>> 55f2FH74uxdpLQ$
>>> We hope that the study group finds this contribution helpful and
>>> informative.
>>> Best regards,
>>> Peter
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg <https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg>
>> 16__;!!EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
>> 55f2FH7Fxs6f2w$

SG16 list run by sg16-owner@lists.isocpp.org