sg16: Re: [SG16] On whitespaces and new-line

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 25 Mar 2021 16:34:13 -0400

On 3/25/21 3:46 PM, Steve Downey via SG16 wrote:
> From https://www.unicode.org/reports/tr44/tr44-26.html#White_Space -
> Unicode® Standard Annex #44 UNICODE CHARACTER DATABASE
> White_Space
> <https://www.unicode.org/reports/tr44/tr44-26.html#White_Space> B N
> Spaces, separator characters and other control characters which should
> be treated by programming languages as "white space" for the purpose
> of parsing elements. See also Line_Break
> <https://www.unicode.org/reports/tr44/tr44-26.html#Line_Break>,
> Grapheme_Cluster_Break
> <https://www.unicode.org/reports/tr44/tr44-26.html#Grapheme_Cluster_Break>,
> Sentence_Break
> <https://www.unicode.org/reports/tr44/tr44-26.html#Sentence_Break>,
> and Word_Break
> <https://www.unicode.org/reports/tr44/tr44-26.html#Word_Break>, which
> classify space characters and related controls somewhat differently
> for particular text segmentation contexts.
>
>
> And from PropList.txt, where the White_Space binary property lives
> https://www.unicode.org/Public/13.0.0/ucd/PropList.txt
>
> 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
> 0020 ; White_Space # Zs SPACE
> 0085 ; White_Space # Cc <control-0085>
> 00A0 ; White_Space # Zs NO-BREAK SPACE
> 1680 ; White_Space # Zs OGHAM SPACE MARK
> 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
> 2028 ; White_Space # Zl LINE SEPARATOR
> 2029 ; White_Space # Zp PARAGRAPH SEPARATOR
> 202F ; White_Space # Zs NARROW NO-BREAK SPACE
> 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
> 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
>
> New-line is a bit more complicated because in some contexts it's a
> line break in source, however that is designated, and other times it
> is exactly the control character '\n', whatever the value of that is.
>
> Raw string literals make this visible, and there's a note that says
> that line breaks in source are to be encoded as \n in the execution
> string.

There are two CWG issues related to this:

  * CWG #1655: Line endings in raw string literals
    <https://wg21.link/cwg1655>
  * CWG #1709: Stringizing raw string literals containing newline
    <https://wg21.link/cwg1709>

Tom.

>
> On Thu, Mar 25, 2021 at 9:46 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Hello,
>
> As indicated in the telecon, here is a mail full of whitespaces
> and line breaks.
>
> The issues with whitespaces and line breaks are multiple.
>
> *Wording:*
>
> - We are not consistent about the spelling of whitespace -
> Editorial PR https://github.com/cplusplus/draft/pull/4557
> <https://github.com/cplusplus/draft/pull/4557>
> - We are not consistent about using "whitespace character" or just
> "whitespace". I believe the solution here would be to make
> whitespace a grammar term
> - We should use the unicode name, in upper case to spell the
> various whitespaces when they are mentioned
> - new-line is sometimes a grammar term, sometimes not
>
> I believe the solution for all of these issues is to introduce and
> use grammar terms for both new-line and whitespaces
>
> *Unicode whitespaces and newlines*
>
> The list of new lines is as follows
>
> LF: Line Feed, U+000A
> VT: Vertical Tab, U+000B
> FF: Form Feed, U+000C
> CR: Carriage Return, U+000D
> CR+LF: CR (U+000D) followed by LF (U+000A)
> *NEL: Next Line, U+0085
> LS: Line Separator, U+2028
> PS: Paragraph Separator, U+2029*
> *
> *
> The list of additional whitespaces is as follow
>
> U+0009 HORIZONTAL TAB
> U+0020 SPACE
> *U+200E LEFT-TO-RIGHT MARK
> U+200F RIGHT-TO-LEFT MARK*
>
> The whitespaces not supported by C++ are in bold.
> That list poses some challenges for C++ and implementations
>
> These additional whitespaces are not in the basic latin block,
> which would require implementations to expect arbitrary unicode in
> places where they might not currently.
> I am not sure that the cost/benefit ratio justifies adding these
> characters.
>
> Furthermore, i think it would be ill-advised to consider LTM and
> RTM in C++ as these change
> the directionality of text. Which, as sensible as it is in
> multilingual prose poses interesting challenges in C++, challenges
> which have already been discussed in the context of UAX31.
>
> NEL is of coursed used by ebcdic but could be mapped in phase 1 to
> LF as is recommended by
> UTF-EBCDIC
>
> As such, I do not think extending the set of new lines and
> whitespaces has much value.
>
> *New lines*
>
> There is, however, a catch there. There always is.
> The mapping of a new line character to any other new line
> character is not observable, except for
> the purpose of raw-string literals.
>
> Which is the subject of CWG-1655
> http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655
>
> I believe that, for the user perspective it is reasonable that
> raw-strings use
> the line terminator appropriate for the target platform.
> It's also in line with the ideas that non-visible characters
> should not impact the semantics of programs and that source code
> should be portable.
>
> I believe the following mechanism would provide the desirable
> observable behavior:
>
> 1/ In phase 1 or 2, after transcoding to Unicode, replace any
> new-line sequence (CR,LF,NEL, CRLF) by LF (in the same way all
> whitespaces and comments are replaced by SPACe in phase 4)
>
> 2/ define new-line to be an implementation-defined sequence of
> abstract character representable in the literal and wide literal
> encodings, (for the benefit of escape-sequences, raw strings and
> chrono)
>
> 3/ In phase 5, before converting to the execution encoding,
> replace each LF by a new-line in raw string literals
> *
> *
> The good news is that we can improve all of that without going to EWG
>
> *Corentin*
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
>
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2021-03-25 15:34:16