Subject: Re: On whitespaces and new-line
From: Corentin (corentin.jabot_at_[hidden])
Date: 2021-03-26 06:00:55
On Fri, Mar 26, 2021, 05:31 Tom Honermann <tom_at_[hidden]> wrote:
> On 3/25/21 4:46 PM, Steve Downey wrote:
> On Thu, Mar 25, 2021 at 4:34 PM Tom Honermann <tom_at_[hidden]> wrote:
>> There are two CWG issues related to this:
>> - CWG #1655: Line endings in raw string literals
>> It looks like the current text addresses this. From 1655 "Is it intended
> that, for example, a CRLF in the source of a raw string literal is to be
> represented as a newline character or as the original characters?"
> From the current draft: http://eel.is/c++draft/lex.string#4
> [*Note 2 <http://eel.is/c++draft/full#lex.string-note-2>*:
> A source-file new-line in a raw string literal results in a new-line in
> the resulting execution string literal.
> Assuming no whitespace at the beginning of lines in the following example,
> the assert will succeed:const char* p = R"(a\ b c)"; assert(std::strcmp(p,
> "a\\\nb\nc") == 0);
> â *end note*]
> It does look that way, but this wording also predates the core issue.
> Perhaps that wording went unnoticed when the issue was recorded. I did
> some brief tests and, assuming I didn't mess it up, it looks like the
> assert passes for gcc (9.1), Clang (7.0), and Visual C++ (2019 16.7) when
> presented a source file with CRLF line endings. Perhaps this issue can be
> resolved as NAD. I sent a message to CWG.
The issue is that there are two different entities both of which are never
"(introducing new-line characters for end-of-line indicators)" doesn't help
Reasonably the intent ( and this follow Unicode recommandations ) is that
what the wording calls "source-file new-line" (which is both undescribed
and presumably incorrect as source file is mapped in phase 1)
is any line-breaking character (in our case LF,CR, NEL...) while the second
new-line refer to a specific, implementation defined sequence of character
representing a line break in the associated string encoding.
I'll give (another) crack at this issue.
I'd like some opinion
I believe there are 2 options in terms of wording - both mechanisms being
indistinguishable from each other.
1/ Specify that a new-line is a specific set of character sequences(lf,
crlf, cr, nel) and make it a grammar element which is then used in [lex]
and [cpp] where *new-line* and new-line are currently mentioned
2/ Specify that in phase 1 line terminators are replaced by LF and replace
all mention of new-line pertaining to lexing by LINE FEED (but not
evaluated raw string literals).
I don't know if I have a preference
In any case I think we want to specify what a _whitespace_ is as a grammar
element and replace all mention of whitespace, whitespaces, whitespace
characters by *whitespace.*
For simplicity, it's probably useful to define *horizontal-whitespace*
*maybe in [lex.token]
If we want to keep exact line terminators in phase 1, we can do the same
for new-line (note, there is currently a grammar production for new-line in
[cpp]: *new-line*: the new-line character)
We could simplify further by adding comments to whitespaces, but there is
no grammar for that :(
SG16 list run by firstname.lastname@example.org