sg16: Re: [SG16] On whitespaces and new-line

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 25 Mar 2021 22:28:51 +0100

On Thu, Mar 25, 2021 at 10:16 PM Steve Downey <sdowney_at_[hidden]> wrote:

> I was looking at White_Space because of the definition: "Spaces, separator
> characters and other control characters which should be treated by
> programming languages as "white space" for the purpose of parsing elements."
>

Interesting! I missed that
It's funny because I can't find any language doing that :)
I also find it... Interesting. Might be worth asking the unicode mailing
about!

> Programming language text has more in common with general text than it
> does with regex. However, I'm not sure the marginal value of adding
> additional white space characters would really justify the cost. Other than
> stunts like writing a "hello, world" entirely in Ogham, it's probably not
> going to get much use. Making sure than NEL is in the list, so that if
> ebcidic is transcoded it remains valid, and doesn't need some variety of
> dos2unix run on it, probably covers all the real world use cases?
>
> RTL modifiers in code certainly don't seem like wonderful things. Has
> anyone checked what clang or gcc does now?
>

Clang doesn't support more than what the standard specifies, although they
do have a table of unicode whitespaces
for the purpose of filtering them out in ucns

https://github.com/llvm/llvm-project/blob/62ec4ac90738a5f2d209ed28c822223e58aaaeb7/clang/include/clang/Basic/CharInfo.h#L70
https://github.com/llvm/llvm-project/blob/main/clang/lib/Lex/UnicodeCharSets.h#L401

>
> On Thu, Mar 25, 2021 at 5:03 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>>
>>
>> On Thu, Mar 25, 2021 at 9:40 PM Steve Downey <sdowney_at_[hidden]> wrote:
>>
>>> It's my understanding that Pattern_White_Space is for pattern languages,
>>> like regex. From TR31: Examples include regular expressions, Java
>>> collation rules, Excel or ICU number formats, and many others. In the past,
>>> regular expressions and other formal languages have been forced to use
>>> clumsy combinations of ASCII characters for their syntax.
>>> https://www.unicode.org/reports/tr31/#Pattern_Syntax
>>>
>>
>> Are you aware of any precedence for these whitespaces in other
>> programming languages?
>> For example, rust uses Pattern_White_Space
>> https://doc.rust-lang.org/reference/whitespace.html
>> Answering my own question, JS (https://tc39.es/ecma262/#prod-WhiteSpace)
>> seems to support Space_Separator
>> https://www.compart.com/en/unicode/category/Zs - but they do not
>> support NEL
>>
>> But... I think we should have some motivation there
>>
>>
>>
>>>
>>> On Thu, Mar 25, 2021 at 4:05 PM Corentin <corentin.jabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Mar 25, 2021 at 8:46 PM Steve Downey <sdowney_at_[hidden]> wrote:
>>>>
>>>>> From https://www.unicode.org/reports/tr44/tr44-26.html#White_Space -
>>>>> Unicode® Standard Annex #44 UNICODE CHARACTER DATABASE
>>>>> White_Space
>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#White_Space> B N Spaces,
>>>>> separator characters and other control characters which should be treated
>>>>> by programming languages as "white space" for the purpose of parsing
>>>>> elements. See also Line_Break
>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Line_Break>,
>>>>> Grapheme_Cluster_Break
>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Grapheme_Cluster_Break>
>>>>> , Sentence_Break
>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Sentence_Break>,
>>>>> and Word_Break
>>>>> <https://www.unicode.org/reports/tr44/tr44-26.html#Word_Break>, which
>>>>> classify space characters and related controls somewhat differently for
>>>>> particular text segmentation contexts.
>>>>>
>>>>> And from PropList.txt, where the White_Space binary property lives
>>>>> https://www.unicode.org/Public/13.0.0/ucd/PropList.txt
>>>>>
>>>>>
>>>>>
>>>>> 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
>>>>> 0020 ; White_Space # Zs SPACE
>>>>> 0085 ; White_Space # Cc <control-0085>
>>>>> 00A0 ; White_Space # Zs NO-BREAK SPACE
>>>>> 1680 ; White_Space # Zs OGHAM SPACE MARK
>>>>> 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
>>>>> 2028 ; White_Space # Zl LINE SEPARATOR
>>>>> 2029 ; White_Space # Zp PARAGRAPH SEPARATOR
>>>>> 202F ; White_Space # Zs NARROW NO-BREAK SPACE
>>>>> 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
>>>>> 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
>>>>>
>>>>>
>>>> I should have clarified that the list I am using is Pattern_White_Space
>>>> https://unicode.org/reports/tr31/#R3
>>>>
>>>> List is in https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
>>>>
>>>>
>>>>>
>>>>> New-line is a bit more complicated because in some contexts it's a
>>>>> line break in source, however that is designated, and other times it is
>>>>> exactly the control character '\n', whatever the value of that is.
>>>>>
>>>>> Raw string literals make this visible, and there's a note that says
>>>>> that line breaks in source are to be encoded as \n in the execution string.
>>>>>
>>>>> On Thu, Mar 25, 2021 at 9:46 AM Corentin via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> As indicated in the telecon, here is a mail full of whitespaces and
>>>>>> line breaks.
>>>>>>
>>>>>> The issues with whitespaces and line breaks are multiple.
>>>>>>
>>>>>> *Wording:*
>>>>>>
>>>>>> - We are not consistent about the spelling of whitespace - Editorial
>>>>>> PR https://github.com/cplusplus/draft/pull/4557
>>>>>> <https://github.com/cplusplus/draft/pull/4557>
>>>>>> - We are not consistent about using "whitespace character" or just
>>>>>> "whitespace". I believe the solution here would be to make whitespace a
>>>>>> grammar term
>>>>>> - We should use the unicode name, in upper case to spell the various
>>>>>> whitespaces when they are mentioned
>>>>>> - new-line is sometimes a grammar term, sometimes not
>>>>>>
>>>>>> I believe the solution for all of these issues is to introduce and
>>>>>> use grammar terms for both new-line and whitespaces
>>>>>>
>>>>>> *Unicode whitespaces and newlines*
>>>>>>
>>>>>> The list of new lines is as follows
>>>>>>
>>>>>> LF: Line Feed, U+000A
>>>>>> VT: Vertical Tab, U+000B
>>>>>> FF: Form Feed, U+000C
>>>>>> CR: Carriage Return, U+000D
>>>>>> CR+LF: CR (U+000D) followed by LF (U+000A)
>>>>>>
>>>>>>
>>>>>> *NEL: Next Line, U+0085LS: Line Separator, U+2028PS: Paragraph
>>>>>> Separator, U+2029*
>>>>>>
>>>>>> The list of additional whitespaces is as follow
>>>>>>
>>>>>> U+0009 HORIZONTAL TAB
>>>>>> U+0020 SPACE
>>>>>>
>>>>>> *U+200E LEFT-TO-RIGHT MARKU+200F RIGHT-TO-LEFT MARK*
>>>>>>
>>>>>> The whitespaces not supported by C++ are in bold.
>>>>>> That list poses some challenges for C++ and implementations
>>>>>>
>>>>>> These additional whitespaces are not in the basic latin block, which
>>>>>> would require implementations to expect arbitrary unicode in places where
>>>>>> they might not currently.
>>>>>> I am not sure that the cost/benefit ratio justifies adding these
>>>>>> characters.
>>>>>>
>>>>>> Furthermore, i think it would be ill-advised to consider LTM and RTM
>>>>>> in C++ as these change
>>>>>> the directionality of text. Which, as sensible as it is in
>>>>>> multilingual prose poses interesting challenges in C++, challenges which
>>>>>> have already been discussed in the context of UAX31.
>>>>>>
>>>>>> NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF
>>>>>> as is recommended by
>>>>>> UTF-EBCDIC
>>>>>>
>>>>>> As such, I do not think extending the set of new lines and
>>>>>> whitespaces has much value.
>>>>>>
>>>>>> *New lines*
>>>>>>
>>>>>> There is, however, a catch there. There always is.
>>>>>> The mapping of a new line character to any other new line character
>>>>>> is not observable, except for
>>>>>> the purpose of raw-string literals.
>>>>>>
>>>>>> Which is the subject of CWG-1655
>>>>>> http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655
>>>>>>
>>>>>> I believe that, for the user perspective it is reasonable that
>>>>>> raw-strings use
>>>>>> the line terminator appropriate for the target platform.
>>>>>> It's also in line with the ideas that non-visible characters should
>>>>>> not impact the semantics of programs and that source code should be
>>>>>> portable.
>>>>>>
>>>>>> I believe the following mechanism would provide the desirable
>>>>>> observable behavior:
>>>>>>
>>>>>> 1/ In phase 1 or 2, after transcoding to Unicode, replace any
>>>>>> new-line sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces
>>>>>> and comments are replaced by SPACe in phase 4)
>>>>>>
>>>>>> 2/ define new-line to be an implementation-defined sequence of
>>>>>> abstract character representable in the literal and wide literal encodings,
>>>>>> (for the benefit of escape-sequences, raw strings and chrono)
>>>>>>
>>>>>> 3/ In phase 5, before converting to the execution encoding, replace
>>>>>> each LF by a new-line in raw string literals
>>>>>>
>>>>>> The good news is that we can improve all of that without going to EWG
>>>>>>
>>>>>> *Corentin*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>>

Received on 2021-03-25 16:29:05