Hello,

As indicated in the telecon, here is a mail full of whitespaces and line breaks.

The issues with whitespaces and line breaks are multiple.

Wording:

- We are not consistent about the spelling of whitespace - Editorial PR https://github.com/cplusplus/draft/pull/4557

- We are not consistent about using "whitespace character" or just "whitespace". I believe the solution here would be to make whitespace a grammar term

- We should use the unicode name, in upper case to spell the various whitespaces when they are mentioned

- new-line is sometimes a grammar term, sometimes not

I believe the solution for all of these issues is to introduce and use grammar terms for both new-line and whitespaces

Unicode whitespaces and newlines

The list of new lines is as follows

LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

The list of additional whitespaces is as follow

U+0009 HORIZONTAL TAB
U+0020 SPACE
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK

The whitespaces not supported by C++ are in bold.

That list poses some challenges for C++ and implementations

These additional whitespaces are not in the basic latin block, which would require implementations to expect arbitrary unicode in places where they might not currently.

I am not sure that the cost/benefit ratio justifies adding these characters.

Furthermore, i think it would be ill-advised to consider LTM and RTM in C++ as these change

the directionality of text. Which, as sensible as it is in multilingual prose poses interesting challenges in C++, challenges which have already been discussed in the context of UAX31.

NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF as is recommended by

UTF-EBCDIC

As such, I do not think extending the set of new lines and whitespaces has much value.

New lines

There is, however, a catch there. There always is.

The mapping of a new line character to any other new line character is not observable, except for

the purpose of raw-string literals.

Which is the subject of CWG-1655

http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655

I believe that, for the user perspective it is reasonable that raw-strings use

the line terminator appropriate for the target platform.

It's also in line with the ideas that non-visible characters should not impact the semantics of programs and that source code should be portable.

I believe the following mechanism would provide the desirable observable behavior:

1/ In phase 1 or 2, after transcoding to Unicode, replace any new-line sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces and comments are replaced by SPACe in phase 4)

2/ define new-line to be an implementation-defined sequence of abstract character representable in the literal and wide literal encodings, (for the benefit of escape-sequences, raw strings and chrono)

3/ In phase 5, before converting to the execution encoding, replace each LF by a new-line in raw string literals

The good news is that we can improve all of that without going to EWG

Corentin