sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Mon, 24 Aug 2020 22:23:36 +0200

On 24/08/2020 21.44, Alisdair Meredith via SG16 wrote:
> Got another good corner case for you!
>
> In the template form of user defined literals, the template parameter pack
> is instiated with characters corresponding to the source text, currently
> mapping non-basic characters to UCNs, so that the template parser can
> assume all characters are members of the basic source character set:
>
> See [lex.ext] 5.13.8p3/4
>
> By no longer mapping to UCNs, we break any UDL parsers that work with
> UCNs today. I don’t know how many there are in production, possibly zero,
> but it is a risk to address, and provide an entry in compatibility Annex C.

UCNs may only be introduced for characters not in the basic source
character set. Could please point out which of the characters allowed
in a user-defined-integer-literal or user-defined-floating-point-literal
are not in the basic source character set?

> I am currently searching the standard for the phrase “source character” and
> trying to make sense of the difference between “source character set” and
> “basic source character set”. The former seems to refer to some mythical
> thing that exists prior to conversion to UCNs, but applies to text being
> processed /after/ UCNification, where it is not clear that is makes a real
> distinction at that point.
>
> Good examples are the h-char and q-char sequences for header names.
> The current text just looks broken for header names outside the basic
> source character set, as the text we actually parse is post-UCNification,
> but it is also conditionally supported behavior to have a ‘\’ character in such
> a char-sequence, indicating that post-UCNified text is problematic.
>
> I believe this paper will be more than the light treatment you seem to expect,
> but it will shake out and fix a few dusty corners giving us a more robust spec
> as part of the process - and that would be another feature of the proposal
> that I could get behind!

Yes. I'd prefer if we could separate the "everything is Unicode" aspect
of the paper from any change to UCN-ification.

For the latter, see https://wiki.edg.com/pub/Wg21summer2020/SG16/charset.html
(early draft).

As a general remark, note that a UCN can represent a value that has no Unicode
character assigned. It is well-defined what we do with such UCNs in a u8
string literal, for instance. I understand P2194R0 that it wants to
restrict C++ to the Unicode character set, an ever-expanding subset of
UCS scalar values. I'm opposed to that; I think we should support the
full range of UCS scalar values (i.e. values 0x0 - 0x10ffff excluding
surrogate code points) so that the next revision of Unicode doesn't
invalidate slightly older compilers.

A separate question is whether there is use for items beyond those
representable by UCS scalar values (avoiding the loaded term
"character" here).

Jens

> AlisdairM
>
>> On Aug 24, 2020, at 12:32, Peter Brett <pbrett_at_[hidden] <mailto:pbrett_at_[hidden]>> wrote:
>>
>> Hi Alisdair,
>>
>> Thank you for the feedback. That's a very good suggestion, thank you. It ties into the suggested change to processing of UCNs that we've discussed a few times.
>>
>> When you have a u8"" literal, the associated literal encoding is UTF-8. When you have a 'plain' "" string literal, the associated literal encoding is implementation-defined.
>>
>> Best regards,
>>
>> Peter
>>
>>> -----Original Message-----
>>> From: Alisdair Meredith <alisdairm_at_[hidden] <mailto:alisdairm_at_[hidden]>>
>>> Sent: 24 August 2020 17:29
>>> To: SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>>> Cc: Peter Brett <pbrett_at_[hidden] <mailto:pbrett_at_[hidden]>>; Corentin <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>>
>>> Subject: Re: [SG16] P2194R0 The character set of C++ source code is Unicode
>>>
>>> EXTERNAL MAIL
>>>
>>>
>>> Minor suggestion on the wording,
>>>
>>> You strike the mapping of non-basic source code characters to
>>> universal-character-name, including the cross-reference to such
>>> mappings reverting in raw string literals (5.4). I suggest making
>>> a matching edit to strike the reference in (5.4)p3 as well, so that
>>> the only thing reverted is line splicing in phase 2.
>>>
>>> That said, with these changes, I am curious what the difference
>>> is between a u8 string literal and a plain ‘char’ string literal, as
>>> the contents of that literal are now going to be unicode source
>>> Text (rather than requesting a mapping from source to unicode
>>> of literal’s contents)?
>>>
>>> AlisdairM
>>>
>>>> On Aug 24, 2020, at 08:31, Peter Brett via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> In this week's meeting, we are going to discuss the remaining
>>>> proposals from P2178R1 "Misc lexing and string handling improvements".
>>>> In particular, we will discuss proposal 9:
>>>>
>>>> Proposal 9: Reaffirming Unicode as the character set of the
>>>> internal representation
>>>>
>>>> In anticipation of a lively discussion, Corentin and I have written a
>>>> short new paper which will be appearing in the September mailing.
>>>>
>>>> P2194R0 The character set of C++ source code is Unicode
>>>>
>>> https://urldefense.com/v3/__https://isocpp.org/files/papers/P2194R0.pdf__;!!
>>> EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
>>> 55f2FH74uxdpLQ$
>>>>
>>>> We hope that the study group finds this contribution helpful and
>>>> informative.
>>>>
>>>> Best regards,
>>>>
>>>> Peter
>>>>
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>>>>
>>> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
>>> 16__;!!EHscmS1ygiU1lA!WEw_cTYDWjEYbwMusvXFTtvDdDjE3jRwp1m4_TAlO-8sXXE-
>>> 55f2FH7Fxs6f2w$
>
>

Received on 2020-08-24 15:27:15