Subject: On whitespaces and new-line
From: Corentin (corentin.jabot_at_[hidden])
Date: 2021-03-25 08:46:29
As indicated in the telecon, here is a mail full of whitespaces and line
The issues with whitespaces and line breaks are multiple.
- We are not consistent about the spelling of whitespace - Editorial PR
- We are not consistent about using "whitespace character" or just
"whitespace". I believe the solution here would be to make whitespace a
- We should use the unicode name, in upper case to spell the various
whitespaces when they are mentioned
- new-line is sometimes a grammar term, sometimes not
I believe the solution for all of these issues is to introduce and use
grammar terms for both new-line and whitespaces
*Unicode whitespaces and newlines*
The list of new lines is as follows
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
*NEL: Next Line, U+0085LS: Line Separator, U+2028PS: Paragraph Separator,
The list of additional whitespaces is as follow
U+0009 HORIZONTAL TAB
*U+200E LEFT-TO-RIGHT MARKU+200F RIGHT-TO-LEFT MARK*
The whitespaces not supported by C++ are in bold.
That list poses some challenges for C++ and implementations
These additional whitespaces are not in the basic latin block, which would
require implementations to expect arbitrary unicode in places where they
might not currently.
I am not sure that the cost/benefit ratio justifies adding these characters.
Furthermore, i think it would be ill-advised to consider LTM and RTM in C++
as these change
the directionality of text. Which, as sensible as it is in multilingual
prose poses interesting challenges in C++, challenges which have already
been discussed in the context of UAX31.
NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF as is
As such, I do not think extending the set of new lines and whitespaces has
There is, however, a catch there. There always is.
The mapping of a new line character to any other new line character is not
observable, except for
the purpose of raw-string literals.
Which is the subject of CWG-1655
I believe that, for the user perspective it is reasonable that raw-strings
the line terminator appropriate for the target platform.
It's also in line with the ideas that non-visible characters should not
impact the semantics of programs and that source code should be portable.
I believe the following mechanism would provide the desirable observable
1/ In phase 1 or 2, after transcoding to Unicode, replace any new-line
sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces and
comments are replaced by SPACe in phase 4)
2/ define new-line to be an implementation-defined sequence of abstract
character representable in the literal and wide literal encodings, (for the
benefit of escape-sequences, raw strings and chrono)
3/ In phase 5, before converting to the execution encoding, replace each LF
by a new-line in raw string literals
The good news is that we can improve all of that without going to EWG
SG16 list run by email@example.com