Date: Mon, 23 May 2022 09:01:02 +0200
On 22/05/2022 23.40, Hubert Tong wrote:
> On Sun, May 22, 2022 at 5:11 PM Hubert Tong
> <hubert.reinterpretcast_at_[hidden]> wrote:
>>
>> On Sun, May 22, 2022 at 2:24 AM Jens Maurer via SG16
>> <sg16_at_[hidden]> wrote:
>>> It seems we need to check that the C++ understanding of whitespace is
>>> a subset of Pattern_White_Space.
>>
>> It is a subset (exactly the subset within U+0000 to U+007F, inclusive;
>> except U+000D CARRIAGE RETURN is not a member of the basic character
>> set(!)).
>
> Once we have P2348, adding U+000D CARRIAGE RETURN to the basic
> character set should actually be editorial.
I disagree with P2348's idea of carrying various spellings of
new-line through. Even in raw string literals, new-lines are
harmonized to "\n", always, so there is no point in preserving
the spelling of new-line during lexing. We should keep the
existing phrasing that phase 1 replaces end-of-line indicators
(which might be CR-LF) with new-line, which we should clarify to
be U+000A LINE FEED.
> Nothing in the wording as it is now precludes the grammar from having
> characters outside the basic character set be significant to parsing.
Agreed.
> \u000d is already ill-formed outside of string and character literals,
> and it is already in the basic literal character set.
Well, in the status quo, U+000D CARRIAGE RETURN could be part of
"new-line", which certainly appears outside of literals.
I agree that the UCN \u000d is ill-formed outside of literals,
though.
>> Also, whether the requirement means that a language has exactly
>> Pattern_White_Space as its definition for whitespace is in question.
Good question. That means WG21 probably shouldn't act on this
until the promised editorial rewrite of UAX31 is available.
Jens
> On Sun, May 22, 2022 at 5:11 PM Hubert Tong
> <hubert.reinterpretcast_at_[hidden]> wrote:
>>
>> On Sun, May 22, 2022 at 2:24 AM Jens Maurer via SG16
>> <sg16_at_[hidden]> wrote:
>>> It seems we need to check that the C++ understanding of whitespace is
>>> a subset of Pattern_White_Space.
>>
>> It is a subset (exactly the subset within U+0000 to U+007F, inclusive;
>> except U+000D CARRIAGE RETURN is not a member of the basic character
>> set(!)).
>
> Once we have P2348, adding U+000D CARRIAGE RETURN to the basic
> character set should actually be editorial.
I disagree with P2348's idea of carrying various spellings of
new-line through. Even in raw string literals, new-lines are
harmonized to "\n", always, so there is no point in preserving
the spelling of new-line during lexing. We should keep the
existing phrasing that phase 1 replaces end-of-line indicators
(which might be CR-LF) with new-line, which we should clarify to
be U+000A LINE FEED.
> Nothing in the wording as it is now precludes the grammar from having
> characters outside the basic character set be significant to parsing.
Agreed.
> \u000d is already ill-formed outside of string and character literals,
> and it is already in the basic literal character set.
Well, in the status quo, U+000D CARRIAGE RETURN could be part of
"new-line", which certainly appears outside of literals.
I agree that the UCN \u000d is ill-formed outside of literals,
though.
>> Also, whether the requirement means that a language has exactly
>> Pattern_White_Space as its definition for whitespace is in question.
Good question. That means WG21 probably shouldn't act on this
until the promised editorial rewrite of UAX31 is available.
Jens
Received on 2022-05-23 07:01:08