sg16: Re: [SG16] On whitespaces and new-line

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 25 Mar 2021 21:05:44 +0100

On Thu, Mar 25, 2021 at 8:46 PM Steve Downey <sdowney_at_[hidden]> wrote:

> From https://www.unicode.org/reports/tr44/tr44-26.html#White_Space -
> Unicode® Standard Annex #44 UNICODE CHARACTER DATABASE
> White_Space
> <https://www.unicode.org/reports/tr44/tr44-26.html#White_Space> B N Spaces,
> separator characters and other control characters which should be treated
> by programming languages as "white space" for the purpose of parsing
> elements. See also Line_Break
> <https://www.unicode.org/reports/tr44/tr44-26.html#Line_Break>,
> Grapheme_Cluster_Break
> <https://www.unicode.org/reports/tr44/tr44-26.html#Grapheme_Cluster_Break>
> , Sentence_Break
> <https://www.unicode.org/reports/tr44/tr44-26.html#Sentence_Break>, and
> Word_Break <https://www.unicode.org/reports/tr44/tr44-26.html#Word_Break>,
> which classify space characters and related controls somewhat differently
> for particular text segmentation contexts.
>
> And from PropList.txt, where the White_Space binary property lives
> https://www.unicode.org/Public/13.0.0/ucd/PropList.txt
>
>
>
> 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
> 0020 ; White_Space # Zs SPACE
> 0085 ; White_Space # Cc <control-0085>
> 00A0 ; White_Space # Zs NO-BREAK SPACE
> 1680 ; White_Space # Zs OGHAM SPACE MARK
> 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
> 2028 ; White_Space # Zl LINE SEPARATOR
> 2029 ; White_Space # Zp PARAGRAPH SEPARATOR
> 202F ; White_Space # Zs NARROW NO-BREAK SPACE
> 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
> 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
>
>
I should have clarified that the list I am using is Pattern_White_Space
https://unicode.org/reports/tr31/#R3

List is in https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

>
> New-line is a bit more complicated because in some contexts it's a line
> break in source, however that is designated, and other times it is exactly
> the control character '\n', whatever the value of that is.
>
> Raw string literals make this visible, and there's a note that says that
> line breaks in source are to be encoded as \n in the execution string.
>
> On Thu, Mar 25, 2021 at 9:46 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>> Hello,
>>
>> As indicated in the telecon, here is a mail full of whitespaces and line
>> breaks.
>>
>> The issues with whitespaces and line breaks are multiple.
>>
>> *Wording:*
>>
>> - We are not consistent about the spelling of whitespace - Editorial PR
>> https://github.com/cplusplus/draft/pull/4557
>> <https://github.com/cplusplus/draft/pull/4557>
>> - We are not consistent about using "whitespace character" or just
>> "whitespace". I believe the solution here would be to make whitespace a
>> grammar term
>> - We should use the unicode name, in upper case to spell the various
>> whitespaces when they are mentioned
>> - new-line is sometimes a grammar term, sometimes not
>>
>> I believe the solution for all of these issues is to introduce and use
>> grammar terms for both new-line and whitespaces
>>
>> *Unicode whitespaces and newlines*
>>
>> The list of new lines is as follows
>>
>> LF: Line Feed, U+000A
>> VT: Vertical Tab, U+000B
>> FF: Form Feed, U+000C
>> CR: Carriage Return, U+000D
>> CR+LF: CR (U+000D) followed by LF (U+000A)
>>
>>
>> *NEL: Next Line, U+0085LS: Line Separator, U+2028PS: Paragraph Separator,
>> U+2029*
>>
>> The list of additional whitespaces is as follow
>>
>> U+0009 HORIZONTAL TAB
>> U+0020 SPACE
>>
>> *U+200E LEFT-TO-RIGHT MARKU+200F RIGHT-TO-LEFT MARK*
>>
>> The whitespaces not supported by C++ are in bold.
>> That list poses some challenges for C++ and implementations
>>
>> These additional whitespaces are not in the basic latin block, which
>> would require implementations to expect arbitrary unicode in places where
>> they might not currently.
>> I am not sure that the cost/benefit ratio justifies adding these
>> characters.
>>
>> Furthermore, i think it would be ill-advised to consider LTM and RTM in
>> C++ as these change
>> the directionality of text. Which, as sensible as it is in multilingual
>> prose poses interesting challenges in C++, challenges which have already
>> been discussed in the context of UAX31.
>>
>> NEL is of coursed used by ebcdic but could be mapped in phase 1 to LF as
>> is recommended by
>> UTF-EBCDIC
>>
>> As such, I do not think extending the set of new lines and whitespaces
>> has much value.
>>
>> *New lines*
>>
>> There is, however, a catch there. There always is.
>> The mapping of a new line character to any other new line character is
>> not observable, except for
>> the purpose of raw-string literals.
>>
>> Which is the subject of CWG-1655
>> http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1655
>>
>> I believe that, for the user perspective it is reasonable that
>> raw-strings use
>> the line terminator appropriate for the target platform.
>> It's also in line with the ideas that non-visible characters should not
>> impact the semantics of programs and that source code should be portable.
>>
>> I believe the following mechanism would provide the desirable observable
>> behavior:
>>
>> 1/ In phase 1 or 2, after transcoding to Unicode, replace any new-line
>> sequence (CR,LF,NEL, CRLF) by LF (in the same way all whitespaces and
>> comments are replaced by SPACe in phase 4)
>>
>> 2/ define new-line to be an implementation-defined sequence of abstract
>> character representable in the literal and wide literal encodings, (for the
>> benefit of escape-sequences, raw strings and chrono)
>>
>> 3/ In phase 5, before converting to the execution encoding, replace each
>> LF by a new-line in raw string literals
>>
>> The good news is that we can improve all of that without going to EWG
>>
>> *Corentin*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2021-03-25 15:05:58