C++ Logo


Advanced search

Subject: Re: During lexing, What constitute new lines and whitespaces ?
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-01 12:01:43

On 6/1/20 8:53 AM, Corentin via SG16 wrote:
> The standard doesn't specify what the new-line character is.
> According to Unicode, the following codepoint sequences should be
> considered lines terminators
Could you please include a reference?
>  LF:    Line Feed, U+000A
>  VT:    Vertical Tab, U+000B
>  FF:    Form Feed, U+000C
>  CR:    Carriage Return, U+000D
>  CR+LF: CR (U+000D) followed by LF (U+000A)
>  NEL:   Next Line, U+0085
>  LS:    Line Separator, U+2028
>  PS:    Paragraph Separator, U+2029
> Similarly, the standard defines "white spaces" loosely as "blanks,
> horizontal and vertical tabs", however there are more white space
> characters in unicode https://en.wikipedia.org/wiki/Whitespace_character
> What I would like to do:
> * Define new-line and white-spaces as grammar term, with an explicit
> list of codepoint sequences.
> * In phase 2, replace all characters which represent a line
> termination with Line Feed (which is reverted later for raw string
> literals). this would notably fix https://wg21.link/cwg1655
> * It would also help to mandate that trailing whitespaces are removed
> in phase 2
> Does that make sense to anyone ?

Without thinking too hard about it, this seems like a reasonable direction.

I'm not fond of adding an additional case of reversion for raw string
literals though.


SG16 list run by sg16-owner@lists.isocpp.org