Hello,
As indicated in the telecon, here is a mail full of whitespaces and line breaks.
The issues with whitespaces and line breaks are multiple.
Wording:
- We are not consistent about using "whitespace character" or just "whitespace". I believe the solution here would be to make whitespace a grammar term
- We should use the unicode name, in upper case to spell the various whitespaces when they are mentioned
- new-line is sometimes a grammar term, sometimes not
I believe the solution for all of these issues is to introduce and use grammar terms for both new-line and whitespaces
Unicode whitespaces and newlines
The list of new lines is as follows
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
The list of additional whitespaces is as follow
U+0009 HORIZONTAL TAB
U+0020 SPACE
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
The whitespaces not supported by C++ are in bold.
That list poses some challenges for C++ and implementations
These additional whitespaces are not in the basic latin block, which would require implementations to expect arbitrary unicode in places where they might not currently.
I am not sure that the cost/benefit ratio justifies adding these characters.
Furthermore, i think it would be ill-advised to consider LTM and RTM in C++ as these change
the directionality of text. Which, as sensible as it is in multilingual prose poses interesting challenges in C++, challenges which have already been discussed in the context of UAX31.
NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF as is recommended by
UTF-EBCDIC
As such, I do not think extending the set of new lines and whitespaces has much value.
New lines
There is, however, a catch there. There always is.
The mapping of a new line character to any other new line character is not observable, except for
the purpose of raw-string literals.
Which is the subject of CWG-1655
I believe that, for the user perspective it is reasonable that raw-strings use
the line terminator appropriate for the target platform.
It's also in line with the ideas that non-visible characters should not impact the semantics of programs and that source code should be portable.
I believe the following mechanism would provide the desirable observable behavior:
1/ In phase 1 or 2, after transcoding to Unicode, replace any new-line sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces and comments are replaced by SPACe in phase 4)
2/ define new-line to be an implementation-defined sequence of abstract character representable in the literal and wide literal encodings, (for the benefit of escape-sequences, raw strings and chrono)
3/ In phase 5, before converting to the execution encoding, replace each LF by a new-line in raw string literals
The good news is that we can improve all of that without going to EWG
Corentin
--