On Mon, Jun 1, 2020, 17:04 Peter Brett <pbrett@cadence.com> wrote:

Is it viable to defer to Unicode for the definitions of new-line and whitespace?

We do it for uax31 in Steve paper so we might be able to.

But it might be worth be explicit about new lines, there are no category or block for that and some new lines are more than one codepoint.

I think white space my be definable in terms of pattern_white_space + new lines

Peter

From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Corentin via SG16
Sent: 01 June 2020 13:54
To: SG16 <sg16@lists.isocpp.org>
Cc: Corentin <corentin.jabot@gmail.com>
Subject: [SG16] During lexing, What constitute new lines and whitespaces ?

EXTERNAL MAIL

The standard doesn't specify what the new-line character is.

According to Unicode, the following codepoint sequences should be considered lines terminators

LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

Similarly, the standard defines "white spaces" loosely as "blanks, horizontal and vertical tabs", however there are more white space characters in unicode https://en.wikipedia.org/wiki/Whitespace_character

What I would like to do:

* Define new-line and white-spaces as grammar term, with an explicit list of codepoint sequences.

* In phase 2, replace all characters which represent a line termination with Line Feed (which is reverted later for raw string literals). this would notably fix https://wg21.link/cwg1655

* It would also help to mandate that trailing whitespaces are removed in phase 2

Does that make sense to anyone ?