C++ Logo


Advanced search

Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Alisdair Meredith <alisdairm_at_[hidden]>
Date: Mon, 24 Aug 2020 12:54:15 -0400
Looking into the wording more deeply, you are going to have to address
the notion of characters in identifiers, if you are no longer mapping into
UCNs, so we can no longer rely on non-digit handling that mapping for

This is probably best addressed in terms of
   http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r5.html <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r5.html>
rather than trying for a different fix that will roll into that later.

Note that I was looking into a much more limited requirement to merely
requiore support for UTF-8 encoded source files as a well-speficied
mapping, rather than switching to unicode as the replacement for the
basic source character set. Once P1945 is adopted, I will have an
easier time buying into this paper, but in isolation is raises too many
tricky questions for me about unicode text in surprising places. I
suspect we will need a deeper review of the grammar around identiiers
and other tokens to flush all the issues out - which is largely what P1945


> On Aug 24, 2020, at 12:32, Peter Brett <pbrett_at_[hidden]> wrote:
> Hi Alisdair,
> Thank you for the feedback. That's a very good suggestion, thank you. It ties into the suggested change to processing of UCNs that we've discussed a few times.
> When you have a u8"" literal, the associated literal encoding is UTF-8. When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.
> Best regards,
> Peter
>> -----Original Message-----
>> From: Alisdair Meredith <alisdairm_at_[hidden] <mailto:alisdairm_at_[hidden]>>
>> Sent: 24 August 2020 17:29
>> To: SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>> Cc: Peter Brett <pbrett_at_[hidden] <mailto:pbrett_at_[hidden]>>; Corentin <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>>
>> Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode
>> Minor suggestion on the wording,
>> You strike the mapping of non-basic source code characters to
>> universal-character-name, including the cross-reference to such
>> mappings reverting in raw string literals (5.4). I suggest making
>> a matching edit to strike the reference in (5.4)p3 as well, so that
>> the only thing reverted is line splicing in phase 2.
>> That said, with these changes, I am curious what the difference
>> is between a u8 string literal and a plain ‘char’ string literal, as
>> the contents of that literal are now going to be unicode source
>> Text (rather than requesting a mapping from source to unicode
>> of literal’s contents)?
>> AlisdairM
>>> On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16_at_[hidden]>
>> wrote:
>>> Hi all,
>>> In this week's meeting, we are going to discuss the remaining
>>> proposals from P2178R1 "Misc lexing and string handling improvements".
>>> In particular, we will discuss proposal 9:
>>> Proposal 9: Reaffirming Unicode as the character set of the
>>> internal representation
>>> In anticipation of a lively discussion, Corentin and I have written a
>>> short new paper which will be appearing in the September mailing.
>>> P2194R0 The character set of C++ source code is Unicode
>> https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__ <https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__>;!!
>> EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
>> 55f2FH74uxdpLQ$
>>> We hope that the study group finds this contribution helpful and
>>> informative.
>>> Best regards,
>>> Peter
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg <https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg>
>> 16__;!!EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
>> 55f2FH7Fxs6f2w$

Received on 2020-08-24 11:57:44