On Mon, Jun 1, 2020, 17:04 Peter Brett <pbrett@cadence.com> wrote:

Is it viable to defer to Unicode for the definitions of new-line and whitespace?


We do it for uax31 in Steve paper so we might be able to.
But it might be worth be explicit about new lines, there are no category or block for that and some new lines are more than one codepoint.

I think white space my be definable in terms of pattern_white_space + new lines

 

             Peter

 

From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Corentin via SG16
Sent: 01 June 2020 13:54
To: SG16 <sg16@lists.isocpp.org>
Cc: Corentin <corentin.jabot@gmail.com>
Subject: [SG16] During lexing, What constitute new lines and whitespaces ?

 

EXTERNAL MAIL

 

The standard doesn't specify what the new-line character is.

According to Unicode, the following codepoint sequences should be considered lines terminators

 

 LF:    Line Feed, U+000A
 VT:    Vertical Tab, U+000B
 FF:    Form Feed, U+000C
 CR:    Carriage Return, U+000D
 CR+LF: CR (U+000D) followed by LF (U+000A)
 NEL:   Next Line, U+0085
 LS:    Line Separator, U+2028
 PS:    Paragraph Separator, U+2029

 

Similarly, the standard defines "white spaces" loosely as "blanks, horizontal and vertical tabs", however there are more white space characters in unicode https://en.wikipedia.org/wiki/Whitespace_character

 

What I would like to do:

 

* Define new-line and white-spaces as grammar term, with an explicit list of codepoint sequences.   

* In phase 2, replace all characters which represent a line termination with Line Feed (which is reverted later for raw string literals). this would notably fix https://wg21.link/cwg1655

* It would also help to mandate that trailing whitespaces are removed in phase 2

 

Does that make sense to anyone ?