sg16: Re: [SG16] On whitespaces and new-line

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 26 Mar 2021 12:00:55 +0100

On Fri, Mar 26, 2021, 05:31 Tom Honermann <tom_at_[hidden]> wrote:

> On 3/25/21 4:46 PM, Steve Downey wrote:
>
>
>
> On Thu, Mar 25, 2021 at 4:34 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> There are two CWG issues related to this:
>>
>> - CWG #1655: Line endings in raw string literals
>> <https://wg21.link/cwg1655>
>>
>> It looks like the current text addresses this. From 1655 "Is it intended
> that, for example, a CRLF in the source of a raw string literal is to be
> represented as a newline character or as the original characters?"
> From the current draft: http://eel.is/c++draft/lex.string#4
> [*Note 2 <http://eel.is/c++draft/full#lex.string-note-2>*:
> A source-file new-line in a raw string literal results in a new-line in
> the resulting execution string literal.
> <http://eel.is/c++draft/full#lex.string-4.sentence-1>
>
> Assuming no whitespace at the beginning of lines in the following example,
> the assert will succeed:const char* p = R"(a\ b c)"; assert(std::strcmp(p,
> "a\\\nb\nc") == 0);
> — *end note*]
>
> It does look that way, but this wording also predates the core issue.
> Perhaps that wording went unnoticed when the issue was recorded. I did
> some brief tests and, assuming I didn't mess it up, it looks like the
> assert passes for gcc (9.1), Clang (7.0), and Visual C++ (2019 16.7) when
> presented a source file with CRLF line endings. Perhaps this issue can be
> resolved as NAD. I sent a message to CWG.
>
The issue is that there are two different entities both of which are never
quite described.
"(introducing new-line characters for end-of-line indicators)" doesn't help

Reasonably the intent ( and this follow Unicode recommandations ) is that
what the wording calls "source-file new-line" (which is both undescribed
and presumably incorrect as source file is mapped in phase 1)
is any line-breaking character (in our case LF,CR, NEL...) while the second
new-line refer to a specific, implementation defined sequence of character
representing a line break in the associated string encoding.

I'll give (another) crack at this issue.

I'd like some opinion
I believe there are 2 options in terms of wording - both mechanisms being
indistinguishable from each other.

1/ Specify that a new-line is a specific set of character sequences(lf,
crlf, cr, nel) and make it a grammar element which is then used in [lex]
and [cpp] where *new-line* and new-line are currently mentioned
2/ Specify that in phase 1 line terminators are replaced by LF and replace
all mention of new-line pertaining to lexing by LINE FEED (but not
evaluated raw string literals).

I don't know if I have a preference

In any case I think we want to specify what a _whitespace_ is as a grammar
element and replace all mention of whitespace, whitespaces, whitespace
characters by *whitespace.*
For simplicity, it's probably useful to define *horizontal-whitespace*
and *whitespace,
*maybe in [lex.token]

*horizontal-whitespace*
    *horizontal-whitespace*
    SPACE
    HORIZONTAL TAB

*whitespace*
* horizontal-whitespace*
     LINE FEED

If we want to keep exact line terminators in phase 1, we can do the same
for new-line (note, there is currently a grammar production for new-line in
[cpp]: *new-line*: the new-line character)

We could simplify further by adding comments to whitespaces, but there is
no grammar for that :(

Tom.
>

Received on 2021-03-26 06:01:08