Subject: Re: During lexing, What constitute new lines and whitespaces ?
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-01 12:14:06
On 6/1/20 1:01 PM, Tom Honermann via SG16 wrote:
> On 6/1/20 8:53 AM, Corentin via SG16 wrote:
>> The standard doesn't specify what the new-line character is.
>> According to Unicode, the following codepoint sequences should be
>> considered lines terminators
> Could you please include a reference?
>> Â LF: Â Â Line Feed, U+000A
>> Â VT: Â Â Vertical Tab, U+000B
>> Â FF: Â Â Form Feed, U+000C
>> Â CR: Â Â Carriage Return, U+000D
>> Â CR+LF: CR (U+000D) followed by LF (U+000A)
>> Â NEL: Â Next Line, U+0085
>> Â LS: Â Â Line Separator, U+2028
>> Â PS: Â Â Paragraph Separator, U+2029
>> Similarly, the standard defines "white spaces" loosely as "blanks,
>> horizontal and vertical tabs", however thereÂ are more white space
>> characters in unicode https://en.wikipedia.org/wiki/Whitespace_character
>> What I wouldÂ like to do:
>> * Define new-line and white-spaces as grammar term, with anÂ explicit
>> list of codepoint sequences.
I know the following doesn't fit in with your wording direction, but for
conceptual clarity, in today's wording, you would be suggesting
something like the following, correct?
- space, horizontal tab, vertical tab, form feed, new-line
- universal-character-name specifying U+000D (Carriage Return), U+0085
(Next Line), U+2028 (Line Separator), U+2029 (Paragraph Separator)
>> * In phase 2, replace all characters which represent a line
>> termination with Line Feed (which is reverted later for raw string
>> literals). this would notably fix https://wg21.link/cwg1655
>> * It would also help to mandate that trailing whitespaces are removed
>> in phase 2
>> Does that make senseÂ to anyone ?
> Without thinking too hard about it, this seems like a reasonable
> I'm not fond of adding an additional case of reversion for raw string
> literals though.
SG16 list run by email@example.com