On 6/1/20 1:01 PM, Tom Honermann via SG16 wrote:
On 6/1/20 8:53 AM, Corentin via SG16 wrote:

The standard doesn't specify what the new-line character is.
According to Unicode, the following codepoint sequences should be considered lines terminators
Could you please include a reference?

 LF:    Line Feed, U+000A
 VT:    Vertical Tab, U+000B
 FF:    Form Feed, U+000C
 CR:    Carriage Return, U+000D
 CR+LF: CR (U+000D) followed by LF (U+000A)
 NEL:   Next Line, U+0085
 LS:    Line Separator, U+2028
 PS:    Paragraph Separator, U+2029


Similarly, the standard defines "white spaces" loosely as "blanks, horizontal and vertical tabs", however there are more white space characters in unicode https://en.wikipedia.org/wiki/Whitespace_character

What I would like to do:

* Define new-line and white-spaces as grammar term, with an explicit list of codepoint sequences.  

I know the following doesn't fit in with your wording direction, but for conceptual clarity, in today's wording, you would be suggesting something like the following, correct?

white-space:
- space, horizontal tab, vertical tab, form feed, new-line
- universal-character-name specifying U+000D (Carriage Return), U+0085 (Next Line), U+2028 (Line Separator), U+2029 (Paragraph Separator)

Tom.

* In phase 2, replace all characters which represent a line termination with Line Feed (which is reverted later for raw string literals). this would notably fix https://wg21.link/cwg1655
* It would also help to mandate that trailing whitespaces are removed in phase 2

Does that make sense to anyone ?

Without thinking too hard about it, this seems like a reasonable direction.

I'm not fond of adding an additional case of reversion for raw string literals though.

Tom.