C++ Logo

sg16

Advanced search

Re: [SG16] Whitespaces again

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 22 Sep 2021 17:46:24 +0200
On Wed, Sep 22, 2021 at 4:44 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Wed, Sep 22, 2021 at 7:21 AM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>> Thanks for the feedbacks
>> New draft https://isocpp.org/files/papers/D2348R2.pdf
>>
>
> Thanks.
>
>
>>
>> On Wed, Sep 22, 2021 at 6:54 AM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>>
>>> In [lex.whitespaces]:
>>> A disambiguation rule is required to prefer matching CRLF as a
>>> *line-break* instead of two *line-break*s.
>>>
>>
>> See Jens' reply, which I agree with
>>
>
> Although the original formulation I gave for the "each non-whitespace
> character" preprocessor token case covered this for everything other than
> the raw string literal case, you still need a disambiguation rule to manage
> the raw string literal case.
>
>
>>
>>>
>>> The [lex.string], the "*line-break*" in a raw string literal wording
>>> could be more explicit about scanning for line-breaks (sequences matching a
>>> *line-break* is not a *line-break* "for free"; it is a *line-break* if,
>>> for example, the grammar asks for a *line-break*).
>>> This can be done by adding *line-break* under the *r-char* grammar and
>>> adjusting the other *r-char* case with the formula from
>>> *single-line-comment-elem*.
>>>
>>
>> I am not sure I agree with all of that, but I do agree that there isn't
>> no line-break as such in a raw string literals.
>> I've replaced it with " A sequence of characters that matches the grammar
>> of line-break ..."
>>
>
> That approach is fine. Still need to watch out for the CRLF versus CR + LF
> ambiguity.
>

I added "A whitespace is the longest sequence of characters that could
constitute a whitespace." in [lex.whitespace]. I believe this is true for
all sort of whitespaces, including comments and
it takes care of both line-breaks and comments

https://isocpp.org/files/papers/D2348R2.pdf


>
>>
>>
>>> In [lex.pptoken]:
>>> The instances of "non-whitespace character" with respect to the "cannot
>>> be one of the above" case is problematic if the interpretation leaves us
>>> with cases where there are Unicode whitespace characters that are a part of
>>> neither a preprocessing token nor a *whitespace*. That's a new
>>> situation, which the surrounding wording could not be relied upon to handle
>>> in a straightforward manner.
>>>
>>
>> Changed to "each character that is not part of a whitespace and that
>> cannot be one of the above"
>> Might need further massaging when merging
>>
>
> The "is not part of a *whitespace*" would only be okay if there was a
> rule elsewhere to prefer interpreting comments as comments. It seems that
> rule is missing. The "cannot be" formulation works the same way as the
> existing "cannot be one of the above": it indicates the preference in
> interpretation and avoids the question of whether a character has a c
>
ertain property while the determination of whether it has that property is
> in play.
>

Fixed

>
>
>>
>>
>>>
>>> This could be fixed by replacing:
>>> each non-whitespace character that cannot be one of the above
>>> =>
>>> each character that cannot be considered part of a *whitespace* and
>>> cannot be one of the above
>>>
>>> This also happens to fix a pre-existing issue that the wording is rather
>>> weak on preferring to interpret comments as comments.
>>>
>>

Received on 2021-09-22 10:46:37