sg16: [SG16] On whitespaces and new-line

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 25 Mar 2021 14:46:29 +0100

Hello,

As indicated in the telecon, here is a mail full of whitespaces and line
breaks.

The issues with whitespaces and line breaks are multiple.

*Wording:*

- We are not consistent about the spelling of whitespace - Editorial PR
https://github.com/cplusplus/draft/pull/4557
<https://github.com/cplusplus/draft/pull/4557>
- We are not consistent about using "whitespace character" or just
"whitespace". I believe the solution here would be to make whitespace a
grammar term
- We should use the unicode name, in upper case to spell the various
whitespaces when they are mentioned
- new-line is sometimes a grammar term, sometimes not

I believe the solution for all of these issues is to introduce and use
grammar terms for both new-line and whitespaces

*Unicode whitespaces and newlines*

The list of new lines is as follows

LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)

*NEL: Next Line, U+0085LS: Line Separator, U+2028PS: Paragraph Separator,
U+2029*

The list of additional whitespaces is as follow

U+0009 HORIZONTAL TAB
U+0020 SPACE

*U+200E LEFT-TO-RIGHT MARKU+200F RIGHT-TO-LEFT MARK*

The whitespaces not supported by C++ are in bold.
That list poses some challenges for C++ and implementations

These additional whitespaces are not in the basic latin block, which would
require implementations to expect arbitrary unicode in places where they
might not currently.
I am not sure that the cost/benefit ratio justifies adding these characters.

Furthermore, i think it would be ill-advised to consider LTM and RTM in C++
as these change
the directionality of text. Which, as sensible as it is in multilingual
prose poses interesting challenges in C++, challenges which have already
been discussed in the context of UAX31.

NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF as is
recommended by
UTF-EBCDIC

As such, I do not think extending the set of new lines and whitespaces has
much value.

*New lines*

There is, however, a catch there. There always is.
The mapping of a new line character to any other new line character is not
observable, except for
the purpose of raw-string literals.

Which is the subject of CWG-1655
http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655

I believe that, for the user perspective it is reasonable that raw-strings
use
the line terminator appropriate for the target platform.
It's also in line with the ideas that non-visible characters should not
impact the semantics of programs and that source code should be portable.

I believe the following mechanism would provide the desirable observable
behavior:

1/ In phase 1 or 2, after transcoding to Unicode, replace any new-line
sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces and
comments are replaced by SPACe in phase 4)

2/ define new-line to be an implementation-defined sequence of abstract
character representable in the literal and wide literal encodings, (for the
benefit of escape-sequences, raw strings and chrono)

3/ In phase 5, before converting to the execution encoding, replace each LF
by a new-line in raw string literals

The good news is that we can improve all of that without going to EWG

*Corentin*

Received on 2021-03-25 08:46:43