C++ Logo

SG16

Advanced search

Subject: Re: During lexing, What constitute new lines and whitespaces ?
From: Corentin (corentin.jabot_at_[hidden])
Date: 2020-06-01 12:20:39


On Mon, 1 Jun 2020 at 19:14, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/1/20 1:01 PM, Tom Honermann via SG16 wrote:
>
> On 6/1/20 8:53 AM, Corentin via SG16 wrote:
>
>
> The standard doesn't specify what the new-line character is.
> According to Unicode, the following codepoint sequences should be
> considered lines terminators
>
> Could you please include a reference?
>
>
https://en.wikipedia.org/wiki/Newline#Unicode which is derived from
https://www.unicode.org/reports/tr14/tr14-32.html

>
> LF: Line Feed, U+000A
> VT: Vertical Tab, U+000B
> FF: Form Feed, U+000C
> CR: Carriage Return, U+000D
> CR+LF: CR (U+000D) followed by LF (U+000A)
> NEL: Next Line, U+0085
> LS: Line Separator, U+2028
> PS: Paragraph Separator, U+2029
>
> Similarly, the standard defines "white spaces" loosely as "blanks,
> horizontal and vertical tabs", however there are more white space
> characters in unicode https://en.wikipedia.org/wiki/Whitespace_character
>
> What I would like to do:
>
> * Define new-line and white-spaces as grammar term, with an explicit list
> of codepoint sequences.
>
> I know the following doesn't fit in with your wording direction, but for
> conceptual clarity, in today's wording, you would be suggesting something
> like the following, correct?
>
> white-space:
> - space, horizontal tab, vertical tab, form feed, new-line
> - universal-character-name specifying U+000D (Carriage Return), U+0085
> (Next Line), U+2028 (Line Separator), U+2029 (Paragraph Separator)
>

Yep, but also universal-character-name with the pattern_white_space
property (which would be easier to list explicitly)

> Tom.
>
> * In phase 2, replace all characters which represent a line termination
> with Line Feed (which is reverted later for raw string literals). this
> would notably fix https://wg21.link/cwg1655
> * It would also help to mandate that trailing whitespaces are removed in
> phase 2
>
> Does that make sense to anyone ?
>
> Without thinking too hard about it, this seems like a reasonable direction.
>
> I'm not fond of adding an additional case of reversion for raw string
> literals though.
>
> Tom.
>
>
>



SG16 list run by sg16-owner@lists.isocpp.org