C++ Logo

sg16

Advanced search

Re: [SG16] During lexing, What constitute new lines and whitespaces ?

From: Peter Brett <pbrett_at_[hidden]>
Date: Mon, 1 Jun 2020 15:04:07 +0000
Is it viable to defer to Unicode for the definitions of new-line and whitespace?

             Peter

From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Corentin via SG16
Sent: 01 June 2020 13:54
To: SG16 <sg16_at_lists.isocpp.org>
Cc: Corentin <corentin.jabot_at_[hidden]>
Subject: [SG16] During lexing, What constitute new lines and whitespaces ?

EXTERNAL MAIL

The standard doesn't specify what the new-line character is.
According to Unicode, the following codepoint sequences should be considered lines terminators

 LF: Line Feed, U+000A
 VT: Vertical Tab, U+000B
 FF: Form Feed, U+000C
 CR: Carriage Return, U+000D
 CR+LF: CR (U+000D) followed by LF (U+000A)
 NEL: Next Line, U+0085
 LS: Line Separator, U+2028
 PS: Paragraph Separator, U+2029

Similarly, the standard defines "white spaces" loosely as "blanks, horizontal and vertical tabs", however there are more white space characters in unicode https://en.wikipedia.org/wiki/Whitespace_character<https://urldefense.com/v3/__https:/en.wikipedia.org/wiki/Whitespace_character__;!!EHscmS1ygiU1lA!U3jVWO1pCSmf1L_-FNLrCBq4cRE-jLHIKOCQR5VZjm0b0pNODFPUUdBU8Oj0NA$>

What I would like to do:

* Define new-line and white-spaces as grammar term, with an explicit list of codepoint sequences.
* In phase 2, replace all characters which represent a line termination with Line Feed (which is reverted later for raw string literals). this would notably fix https://wg21.link/cwg1655<https://urldefense.com/v3/__https:/wg21.link/cwg1655__;!!EHscmS1ygiU1lA!U3jVWO1pCSmf1L_-FNLrCBq4cRE-jLHIKOCQR5VZjm0b0pNODFPUUdBPzn6CdA$>
* It would also help to mandate that trailing whitespaces are removed in phase 2

Does that make sense to anyone ?

Received on 2020-06-01 10:07:20