C++ Logo

SG16

Advanced search

Subject: Re: On whitespaces and new-line
From: Corentin (corentin.jabot_at_[hidden])
Date: 2021-03-25 16:03:10


On Thu, Mar 25, 2021 at 9:40 PM Steve Downey <sdowney_at_[hidden]> wrote:

> It's my understanding that Pattern_White_Space is for pattern languages,
> like regex. From TR31: Examples include regular expressions, Java
> collation rules, Excel or ICU number formats, and many others. In the past,
> regular expressions and other formal languages have been forced to use
> clumsy combinations of ASCII characters for their syntax.
> https://www.unicode.org/reports/tr31/#Pattern_Syntax
>

Are you aware of any precedence for these whitespaces in other programming
languages?
For example, rust uses Pattern_White_Space
https://doc.rust-lang.org/reference/whitespace.html
Answering my own question, JS (https://tc39.es/ecma262/#prod-WhiteSpace)
seems to support Space_Separator
https://www.compart.com/en/unicode/category/Zs - but they do not
support NEL

But... I think we should have some motivation there

>
> On Thu, Mar 25, 2021 at 4:05 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>>
>>
>> On Thu, Mar 25, 2021 at 8:46 PM Steve Downey <sdowney_at_[hidden]> wrote:
>>
>>> From https://www.unicode.org/reports/tr44/tr44-26.html#White_Space -
>>> Unicode® Standard Annex #44 UNICODE CHARACTER DATABASE
>>> White_Space
>>> <https://www.unicode.org/reports/tr44/tr44-26.html#White_Space> B N Spaces,
>>> separator characters and other control characters which should be treated
>>> by programming languages as "white space" for the purpose of parsing
>>> elements. See also Line_Break
>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Line_Break>,
>>> Grapheme_Cluster_Break
>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Grapheme_Cluster_Break>
>>> , Sentence_Break
>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Sentence_Break>, and
>>> Word_Break
>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Word_Break>, which
>>> classify space characters and related controls somewhat differently for
>>> particular text segmentation contexts.
>>>
>>> And from PropList.txt, where the White_Space binary property lives
>>> https://www.unicode.org/Public/13.0.0/ucd/PropList.txt
>>>
>>>
>>>
>>> 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
>>> 0020 ; White_Space # Zs SPACE
>>> 0085 ; White_Space # Cc <control-0085>
>>> 00A0 ; White_Space # Zs NO-BREAK SPACE
>>> 1680 ; White_Space # Zs OGHAM SPACE MARK
>>> 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
>>> 2028 ; White_Space # Zl LINE SEPARATOR
>>> 2029 ; White_Space # Zp PARAGRAPH SEPARATOR
>>> 202F ; White_Space # Zs NARROW NO-BREAK SPACE
>>> 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
>>> 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
>>>
>>>
>> I should have clarified that the list I am using is Pattern_White_Space
>> https://unicode.org/reports/tr31/#R3
>>
>> List is in https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
>>
>>
>>>
>>> New-line is a bit more complicated because in some contexts it's a line
>>> break in source, however that is designated, and other times it is exactly
>>> the control character '\n', whatever the value of that is.
>>>
>>> Raw string literals make this visible, and there's a note that says that
>>> line breaks in source are to be encoded as \n in the execution string.
>>>
>>> On Thu, Mar 25, 2021 at 9:46 AM Corentin via SG16 <sg16_at_[hidden]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> As indicated in the telecon, here is a mail full of whitespaces and
>>>> line breaks.
>>>>
>>>> The issues with whitespaces and line breaks are multiple.
>>>>
>>>> *Wording:*
>>>>
>>>> - We are not consistent about the spelling of whitespace - Editorial PR
>>>> https://github.com/cplusplus/draft/pull/4557
>>>> <https://github.com/cplusplus/draft/pull/4557>
>>>> - We are not consistent about using "whitespace character" or just
>>>> "whitespace". I believe the solution here would be to make whitespace a
>>>> grammar term
>>>> - We should use the unicode name, in upper case to spell the various
>>>> whitespaces when they are mentioned
>>>> - new-line is sometimes a grammar term, sometimes not
>>>>
>>>> I believe the solution for all of these issues is to introduce and use
>>>> grammar terms for both new-line and whitespaces
>>>>
>>>> *Unicode whitespaces and newlines*
>>>>
>>>> The list of new lines is as follows
>>>>
>>>> LF: Line Feed, U+000A
>>>> VT: Vertical Tab, U+000B
>>>> FF: Form Feed, U+000C
>>>> CR: Carriage Return, U+000D
>>>> CR+LF: CR (U+000D) followed by LF (U+000A)
>>>>
>>>>
>>>> *NEL: Next Line, U+0085LS: Line Separator, U+2028PS: Paragraph
>>>> Separator, U+2029*
>>>>
>>>> The list of additional whitespaces is as follow
>>>>
>>>> U+0009 HORIZONTAL TAB
>>>> U+0020 SPACE
>>>>
>>>> *U+200E LEFT-TO-RIGHT MARKU+200F RIGHT-TO-LEFT MARK*
>>>>
>>>> The whitespaces not supported by C++ are in bold.
>>>> That list poses some challenges for C++ and implementations
>>>>
>>>> These additional whitespaces are not in the basic latin block, which
>>>> would require implementations to expect arbitrary unicode in places where
>>>> they might not currently.
>>>> I am not sure that the cost/benefit ratio justifies adding these
>>>> characters.
>>>>
>>>> Furthermore, i think it would be ill-advised to consider LTM and RTM in
>>>> C++ as these change
>>>> the directionality of text. Which, as sensible as it is in multilingual
>>>> prose poses interesting challenges in C++, challenges which have already
>>>> been discussed in the context of UAX31.
>>>>
>>>> NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF
>>>> as is recommended by
>>>> UTF-EBCDIC
>>>>
>>>> As such, I do not think extending the set of new lines and whitespaces
>>>> has much value.
>>>>
>>>> *New lines*
>>>>
>>>> There is, however, a catch there. There always is.
>>>> The mapping of a new line character to any other new line character is
>>>> not observable, except for
>>>> the purpose of raw-string literals.
>>>>
>>>> Which is the subject of CWG-1655
>>>> http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655
>>>>
>>>> I believe that, for the user perspective it is reasonable that
>>>> raw-strings use
>>>> the line terminator appropriate for the target platform.
>>>> It's also in line with the ideas that non-visible characters should not
>>>> impact the semantics of programs and that source code should be portable.
>>>>
>>>> I believe the following mechanism would provide the desirable
>>>> observable behavior:
>>>>
>>>> 1/ In phase 1 or 2, after transcoding to Unicode, replace any new-line
>>>> sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces and
>>>> comments are replaced by SPACe in phase 4)
>>>>
>>>> 2/ define new-line to be an implementation-defined sequence of abstract
>>>> character representable in the literal and wide literal encodings, (for the
>>>> benefit of escape-sequences, raw strings and chrono)
>>>>
>>>> 3/ In phase 5, before converting to the execution encoding, replace
>>>> each LF by a new-line in raw string literals
>>>>
>>>> The good news is that we can improve all of that without going to EWG
>>>>
>>>> *Corentin*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>



SG16 list run by sg16-owner@lists.isocpp.org