C++ Logo

sg16

Advanced search

Re: [SG16] On whitespaces and new-line

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 25 Mar 2021 17:16:24 -0400
I was looking at White_Space because of the definition: "Spaces, separator
characters and other control characters which should be treated by
programming languages as "white space" for the purpose of parsing elements."
Programming language text has more in common with general text than it does
with regex. However, I'm not sure the marginal value of adding additional
white space characters would really justify the cost. Other than stunts
like writing a "hello, world" entirely in Ogham, it's probably not going to
get much use. Making sure than NEL is in the list, so that if ebcidic is
transcoded it remains valid, and doesn't need some variety of dos2unix run
on it, probably covers all the real world use cases?

RTL modifiers in code certainly don't seem like wonderful things. Has
anyone checked what clang or gcc does now?

On Thu, Mar 25, 2021 at 5:03 PM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Thu, Mar 25, 2021 at 9:40 PM Steve Downey <sdowney_at_[hidden]> wrote:
>
>> It's my understanding that Pattern_White_Space is for pattern languages,
>> like regex. From TR31: Examples include regular expressions, Java
>> collation rules, Excel or ICU number formats, and many others. In the past,
>> regular expressions and other formal languages have been forced to use
>> clumsy combinations of ASCII characters for their syntax.
>> https://www.unicode.org/reports/tr31/#Pattern_Syntax
>>
>
> Are you aware of any precedence for these whitespaces in other programming
> languages?
> For example, rust uses Pattern_White_Space
> https://doc.rust-lang.org/reference/whitespace.html
> Answering my own question, JS (https://tc39.es/ecma262/#prod-WhiteSpace)
> seems to support Space_Separator
> https://www.compart.com/en/unicode/category/Zs - but they do not
> support NEL
>
> But... I think we should have some motivation there
>
>
>
>>
>> On Thu, Mar 25, 2021 at 4:05 PM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Thu, Mar 25, 2021 at 8:46 PM Steve Downey <sdowney_at_[hidden]> wrote:
>>>
>>>> From https://www.unicode.org/reports/tr44/tr44-26.html#White_Space -
>>>> UnicodeĀ® Standard Annex #44 UNICODE CHARACTER DATABASE
>>>> White_Space
>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#White_Space> B N Spaces,
>>>> separator characters and other control characters which should be treated
>>>> by programming languages as "white space" for the purpose of parsing
>>>> elements. See also Line_Break
>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Line_Break>,
>>>> Grapheme_Cluster_Break
>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Grapheme_Cluster_Break>
>>>> , Sentence_Break
>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Sentence_Break>,
>>>> and Word_Break
>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Word_Break>, which
>>>> classify space characters and related controls somewhat differently for
>>>> particular text segmentation contexts.
>>>>
>>>> And from PropList.txt, where the White_Space binary property lives
>>>> https://www.unicode.org/Public/13.0.0/ucd/PropList.txt
>>>>
>>>>
>>>>
>>>> 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
>>>> 0020 ; White_Space # Zs SPACE
>>>> 0085 ; White_Space # Cc <control-0085>
>>>> 00A0 ; White_Space # Zs NO-BREAK SPACE
>>>> 1680 ; White_Space # Zs OGHAM SPACE MARK
>>>> 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
>>>> 2028 ; White_Space # Zl LINE SEPARATOR
>>>> 2029 ; White_Space # Zp PARAGRAPH SEPARATOR
>>>> 202F ; White_Space # Zs NARROW NO-BREAK SPACE
>>>> 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
>>>> 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
>>>>
>>>>
>>> I should have clarified that the list I am using is Pattern_White_Space
>>> https://unicode.org/reports/tr31/#R3
>>>
>>> List is in https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
>>>
>>>
>>>>
>>>> New-line is a bit more complicated because in some contexts it's a line
>>>> break in source, however that is designated, and other times it is exactly
>>>> the control character '\n', whatever the value of that is.
>>>>
>>>> Raw string literals make this visible, and there's a note that says
>>>> that line breaks in source are to be encoded as \n in the execution string.
>>>>
>>>> On Thu, Mar 25, 2021 at 9:46 AM Corentin via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As indicated in the telecon, here is a mail full of whitespaces and
>>>>> line breaks.
>>>>>
>>>>> The issues with whitespaces and line breaks are multiple.
>>>>>
>>>>> *Wording:*
>>>>>
>>>>> - We are not consistent about the spelling of whitespace - Editorial
>>>>> PR https://github.com/cplusplus/draft/pull/4557
>>>>> <https://github.com/cplusplus/draft/pull/4557>
>>>>> - We are not consistent about using "whitespace character" or just
>>>>> "whitespace". I believe the solution here would be to make whitespace a
>>>>> grammar term
>>>>> - We should use the unicode name, in upper case to spell the various
>>>>> whitespaces when they are mentioned
>>>>> - new-line is sometimes a grammar term, sometimes not
>>>>>
>>>>> I believe the solution for all of these issues is to introduce and use
>>>>> grammar terms for both new-line and whitespaces
>>>>>
>>>>> *Unicode whitespaces and newlines*
>>>>>
>>>>> The list of new lines is as follows
>>>>>
>>>>> LF: Line Feed, U+000A
>>>>> VT: Vertical Tab, U+000B
>>>>> FF: Form Feed, U+000C
>>>>> CR: Carriage Return, U+000D
>>>>> CR+LF: CR (U+000D) followed by LF (U+000A)
>>>>>
>>>>>
>>>>> *NEL: Next Line, U+0085LS: Line Separator, U+2028PS: Paragraph
>>>>> Separator, U+2029*
>>>>>
>>>>> The list of additional whitespaces is as follow
>>>>>
>>>>> U+0009 HORIZONTAL TAB
>>>>> U+0020 SPACE
>>>>>
>>>>> *U+200E LEFT-TO-RIGHT MARKU+200F RIGHT-TO-LEFT MARK*
>>>>>
>>>>> The whitespaces not supported by C++ are in bold.
>>>>> That list poses some challenges for C++ and implementations
>>>>>
>>>>> These additional whitespaces are not in the basic latin block, which
>>>>> would require implementations to expect arbitrary unicode in places where
>>>>> they might not currently.
>>>>> I am not sure that the cost/benefit ratio justifies adding these
>>>>> characters.
>>>>>
>>>>> Furthermore, i think it would be ill-advised to consider LTM and RTM
>>>>> in C++ as these change
>>>>> the directionality of text. Which, as sensible as it is in
>>>>> multilingual prose poses interesting challenges in C++, challenges which
>>>>> have already been discussed in the context of UAX31.
>>>>>
>>>>> NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF
>>>>> as is recommended by
>>>>> UTF-EBCDIC
>>>>>
>>>>> As such, I do not think extending the set of new lines and whitespaces
>>>>> has much value.
>>>>>
>>>>> *New lines*
>>>>>
>>>>> There is, however, a catch there. There always is.
>>>>> The mapping of a new line character to any other new line character is
>>>>> not observable, except for
>>>>> the purpose of raw-string literals.
>>>>>
>>>>> Which is the subject of CWG-1655
>>>>> http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655
>>>>>
>>>>> I believe that, for the user perspective it is reasonable that
>>>>> raw-strings use
>>>>> the line terminator appropriate for the target platform.
>>>>> It's also in line with the ideas that non-visible characters should
>>>>> not impact the semantics of programs and that source code should be
>>>>> portable.
>>>>>
>>>>> I believe the following mechanism would provide the desirable
>>>>> observable behavior:
>>>>>
>>>>> 1/ In phase 1 or 2, after transcoding to Unicode, replace any new-line
>>>>> sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces and
>>>>> comments are replaced by SPACe in phase 4)
>>>>>
>>>>> 2/ define new-line to be an implementation-defined sequence of
>>>>> abstract character representable in the literal and wide literal encodings,
>>>>> (for the benefit of escape-sequences, raw strings and chrono)
>>>>>
>>>>> 3/ In phase 5, before converting to the execution encoding, replace
>>>>> each LF by a new-line in raw string literals
>>>>>
>>>>> The good news is that we can improve all of that without going to EWG
>>>>>
>>>>> *Corentin*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden]
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>>

Received on 2021-03-25 16:16:51