C++ Logo

SG16

Advanced search

Subject: Re: On whitespaces and new-line
From: Corentin (corentin.jabot_at_[hidden])
Date: 2021-03-26 06:45:38


On Thu, Mar 25, 2021 at 10:28 PM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Thu, Mar 25, 2021 at 10:16 PM Steve Downey <sdowney_at_[hidden]> wrote:
>
>> I was looking at White_Space because of the definition: "Spaces,
>> separator characters and other control characters which should be treated
>> by programming languages as "white space" for the purpose of parsing
>> elements."
>>
>
> Interesting! I missed that
> It's funny because I can't find any language doing that :)
> I also find it... Interesting. Might be worth asking the unicode mailing
> about!
>

Mail to unicode
https://corp.unicode.org/pipermail/unicode/2021-March/009395.html

>
>
>
>> Programming language text has more in common with general text than it
>> does with regex. However, I'm not sure the marginal value of adding
>> additional white space characters would really justify the cost. Other than
>> stunts like writing a "hello, world" entirely in Ogham, it's probably not
>> going to get much use. Making sure than NEL is in the list, so that if
>> ebcidic is transcoded it remains valid, and doesn't need some variety of
>> dos2unix run on it, probably covers all the real world use cases?
>>
>> RTL modifiers in code certainly don't seem like wonderful things. Has
>> anyone checked what clang or gcc does now?
>>
>
> Clang doesn't support more than what the standard specifies, although they
> do have a table of unicode whitespaces
> for the purpose of filtering them out in ucns
>
>
> https://github.com/llvm/llvm-project/blob/62ec4ac90738a5f2d209ed28c822223e58aaaeb7/clang/include/clang/Basic/CharInfo.h#L70
>
> https://github.com/llvm/llvm-project/blob/main/clang/lib/Lex/UnicodeCharSets.h#L401
>
>
>>
>> On Thu, Mar 25, 2021 at 5:03 PM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Thu, Mar 25, 2021 at 9:40 PM Steve Downey <sdowney_at_[hidden]> wrote:
>>>
>>>> It's my understanding that Pattern_White_Space is for pattern
>>>> languages, like regex. From TR31: Examples include regular
>>>> expressions, Java collation rules, Excel or ICU number formats, and many
>>>> others. In the past, regular expressions and other formal languages have
>>>> been forced to use clumsy combinations of ASCII characters for their
>>>> syntax. https://www.unicode.org/reports/tr31/#Pattern_Syntax
>>>>
>>>
>>> Are you aware of any precedence for these whitespaces in other
>>> programming languages?
>>> For example, rust uses Pattern_White_Space
>>> https://doc.rust-lang.org/reference/whitespace.html
>>> Answering my own question, JS (https://tc39.es/ecma262/#prod-WhiteSpace)
>>> seems to support Space_Separator
>>> https://www.compart.com/en/unicode/category/Zs - but they do not
>>> support NEL
>>>
>>> But... I think we should have some motivation there
>>>
>>>
>>>
>>>>
>>>> On Thu, Mar 25, 2021 at 4:05 PM Corentin <corentin.jabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 25, 2021 at 8:46 PM Steve Downey <sdowney_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> From https://www.unicode.org/reports/tr44/tr44-26.html#White_Space -
>>>>>> Unicode® Standard Annex #44 UNICODE CHARACTER DATABASE
>>>>>> White_Space
>>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#White_Space> B N Spaces,
>>>>>> separator characters and other control characters which should be treated
>>>>>> by programming languages as "white space" for the purpose of parsing
>>>>>> elements. See also Line_Break
>>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Line_Break>,
>>>>>> Grapheme_Cluster_Break
>>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Grapheme_Cluster_Break>
>>>>>> , Sentence_Break
>>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Sentence_Break>,
>>>>>> and Word_Break
>>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Word_Break>,
>>>>>> which classify space characters and related controls somewhat differently
>>>>>> for particular text segmentation contexts.
>>>>>>
>>>>>> And from PropList.txt, where the White_Space binary property lives
>>>>>> https://www.unicode.org/Public/13.0.0/ucd/PropList.txt
>>>>>>
>>>>>>
>>>>>>
>>>>>> 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
>>>>>> 0020 ; White_Space # Zs SPACE
>>>>>> 0085 ; White_Space # Cc <control-0085>
>>>>>> 00A0 ; White_Space # Zs NO-BREAK SPACE
>>>>>> 1680 ; White_Space # Zs OGHAM SPACE MARK
>>>>>> 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
>>>>>> 2028 ; White_Space # Zl LINE SEPARATOR
>>>>>> 2029 ; White_Space # Zp PARAGRAPH SEPARATOR
>>>>>> 202F ; White_Space # Zs NARROW NO-BREAK SPACE
>>>>>> 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
>>>>>> 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
>>>>>>
>>>>>>
>>>>> I should have clarified that the list I am using is Pattern_White_Space
>>>>> https://unicode.org/reports/tr31/#R3
>>>>>
>>>>> List is in https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
>>>>>
>>>>>
>>>>>>
>>>>>> New-line is a bit more complicated because in some contexts it's a
>>>>>> line break in source, however that is designated, and other times it is
>>>>>> exactly the control character '\n', whatever the value of that is.
>>>>>>
>>>>>> Raw string literals make this visible, and there's a note that says
>>>>>> that line breaks in source are to be encoded as \n in the execution string.
>>>>>>
>>>>>> On Thu, Mar 25, 2021 at 9:46 AM Corentin via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> As indicated in the telecon, here is a mail full of whitespaces and
>>>>>>> line breaks.
>>>>>>>
>>>>>>> The issues with whitespaces and line breaks are multiple.
>>>>>>>
>>>>>>> *Wording:*
>>>>>>>
>>>>>>> - We are not consistent about the spelling of whitespace - Editorial
>>>>>>> PR https://github.com/cplusplus/draft/pull/4557
>>>>>>> <https://github.com/cplusplus/draft/pull/4557>
>>>>>>> - We are not consistent about using "whitespace character" or just
>>>>>>> "whitespace". I believe the solution here would be to make whitespace a
>>>>>>> grammar term
>>>>>>> - We should use the unicode name, in upper case to spell the various
>>>>>>> whitespaces when they are mentioned
>>>>>>> - new-line is sometimes a grammar term, sometimes not
>>>>>>>
>>>>>>> I believe the solution for all of these issues is to introduce and
>>>>>>> use grammar terms for both new-line and whitespaces
>>>>>>>
>>>>>>> *Unicode whitespaces and newlines*
>>>>>>>
>>>>>>> The list of new lines is as follows
>>>>>>>
>>>>>>> LF: Line Feed, U+000A
>>>>>>> VT: Vertical Tab, U+000B
>>>>>>> FF: Form Feed, U+000C
>>>>>>> CR: Carriage Return, U+000D
>>>>>>> CR+LF: CR (U+000D) followed by LF (U+000A)
>>>>>>>
>>>>>>>
>>>>>>> *NEL: Next Line, U+0085LS: Line Separator, U+2028PS: Paragraph
>>>>>>> Separator, U+2029*
>>>>>>>
>>>>>>> The list of additional whitespaces is as follow
>>>>>>>
>>>>>>> U+0009 HORIZONTAL TAB
>>>>>>> U+0020 SPACE
>>>>>>>
>>>>>>> *U+200E LEFT-TO-RIGHT MARKU+200F RIGHT-TO-LEFT MARK*
>>>>>>>
>>>>>>> The whitespaces not supported by C++ are in bold.
>>>>>>> That list poses some challenges for C++ and implementations
>>>>>>>
>>>>>>> These additional whitespaces are not in the basic latin block, which
>>>>>>> would require implementations to expect arbitrary unicode in places where
>>>>>>> they might not currently.
>>>>>>> I am not sure that the cost/benefit ratio justifies adding these
>>>>>>> characters.
>>>>>>>
>>>>>>> Furthermore, i think it would be ill-advised to consider LTM and RTM
>>>>>>> in C++ as these change
>>>>>>> the directionality of text. Which, as sensible as it is in
>>>>>>> multilingual prose poses interesting challenges in C++, challenges which
>>>>>>> have already been discussed in the context of UAX31.
>>>>>>>
>>>>>>> NEL is of coursed used by ebcdic but could be mapped in phase 1 to
>>>>>>> LF as is recommended by
>>>>>>> UTF-EBCDIC
>>>>>>>
>>>>>>> As such, I do not think extending the set of new lines and
>>>>>>> whitespaces has much value.
>>>>>>>
>>>>>>> *New lines*
>>>>>>>
>>>>>>> There is, however, a catch there. There always is.
>>>>>>> The mapping of a new line character to any other new line character
>>>>>>> is not observable, except for
>>>>>>> the purpose of raw-string literals.
>>>>>>>
>>>>>>> Which is the subject of CWG-1655
>>>>>>> http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655
>>>>>>>
>>>>>>> I believe that, for the user perspective it is reasonable that
>>>>>>> raw-strings use
>>>>>>> the line terminator appropriate for the target platform.
>>>>>>> It's also in line with the ideas that non-visible characters should
>>>>>>> not impact the semantics of programs and that source code should be
>>>>>>> portable.
>>>>>>>
>>>>>>> I believe the following mechanism would provide the desirable
>>>>>>> observable behavior:
>>>>>>>
>>>>>>> 1/ In phase 1 or 2, after transcoding to Unicode, replace any
>>>>>>> new-line sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces
>>>>>>> and comments are replaced by SPACe in phase 4)
>>>>>>>
>>>>>>> 2/ define new-line to be an implementation-defined sequence of
>>>>>>> abstract character representable in the literal and wide literal encodings,
>>>>>>> (for the benefit of escape-sequences, raw strings and chrono)
>>>>>>>
>>>>>>> 3/ In phase 5, before converting to the execution encoding, replace
>>>>>>> each LF by a new-line in raw string literals
>>>>>>>
>>>>>>> The good news is that we can improve all of that without going to EWG
>>>>>>>
>>>>>>> *Corentin*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> SG16 mailing list
>>>>>>> SG16_at_[hidden]
>>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>>
>>>>>>



SG16 list run by sg16-owner@lists.isocpp.org