Subject: Re: P2194R0 The character set of C++ source code is Unicode
From: Alisdair Meredith (alisdairm_at_[hidden])
Date: 2020-08-24 11:40:14
I think I get it - âtextâ is represented in the execution character set,
whereas u8âtextâ is represented directly as UTF-8. In both cases
the abstract machineâs internal representation will be UTF-8, we
are simply moving around where the standard talks about some of
the transcoding occurs.
Also, as you cite ISO 10646:1993 for provenance, it might be worth
pointing out how regularly that standard has been updated, and our
current reference is to a document withdrawn by ISO in 2003, two
Full decades before our planned C++23 publication!
Latest standard is 2017, with the current FDIS for 2020 out for review,
pending CD ballot: https://www.iso.org/standard/76835.html
> On Aug 24, 2020, at 12:32, Peter Brett <pbrett_at_[hidden]> wrote:
> Hi Alisdair,
> Thank you for the feedback. That's a very good suggestion, thank you. It ties into the suggested change to processing of UCNs that we've discussed a few times.
> When you have a u8"" literal, the associated literal encoding is UTF-8. When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.
> Best regards,
>> -----Original Message-----
>> From: Alisdair Meredith <alisdairm_at_[hidden] <mailto:alisdairm_at_[hidden]>>
>> Sent: 24 August 2020 17:29
>> To: SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>> Cc: Peter Brett <pbrett_at_[hidden] <mailto:pbrett_at_[hidden]>>; Corentin <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>>
>> Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode
>> EXTERNAL MAIL
>> Minor suggestion on the wording,
>> You strike the mapping of non-basic source code characters to
>> universal-character-name, including the cross-reference to such
>> mappings reverting in raw string literals (5.4). I suggest making
>> a matching edit to strike the reference in (5.4)p3 as well, so that
>> the only thing reverted is line splicing in phase 2.
>> That said, with these changes, I am curious what the difference
>> is between a u8 string literal and a plain âcharâ string literal, as
>> the contents of that literal are now going to be unicode source
>> Text (rather than requesting a mapping from source to unicode
>> of literalâs contents)?
>>> On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16_at_[hidden]>
>>> Hi all,
>>> In this week's meeting, we are going to discuss the remaining
>>> proposals from P2178R1 "Misc lexing and string handling improvements".
>>> In particular, we will discuss proposal 9:
>>> Proposal 9: Reaffirming Unicode as the character set of the
>>> internal representation
>>> In anticipation of a lively discussion, Corentin and I have written a
>>> short new paper which will be appearing in the September mailing.
>>> P2194R0 The character set of C++ source code is Unicode
>> https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__ <https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__>;!!
>>> We hope that the study group finds this contribution helpful and
>>> Best regards,
>>> SG16 mailing list
>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg <https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg>
SG16 list run by firstname.lastname@example.org