On Wed, Sep 22, 2021 at 4:44 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Wed, Sep 22, 2021 at 7:21 AM Corentin <corentin.jabot@gmail.com> wrote:
Thanks for the feedbacks

Thanks.
 

On Wed, Sep 22, 2021 at 6:54 AM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
 
In [lex.whitespaces]:
A disambiguation rule is required to prefer matching CRLF as a line-break instead of two line-breaks.

See Jens' reply, which I agree with

Although the original formulation I gave for the "each non-whitespace character" preprocessor token case covered this for everything other than the raw string literal case, you still need a disambiguation rule to manage the raw string literal case.

 

The [lex.string], the "line-break" in a raw string literal wording could be more explicit about scanning for line-breaks (sequences matching a line-break is not a line-break "for free"; it is a line-break if, for example, the grammar asks for a line-break).
This can be done by adding line-break under the r-char grammar and adjusting the other r-char case with the formula from single-line-comment-elem.

I am not sure I agree with all of that, but I do agree that there isn't no line-break as such in a raw string literals.
I've replaced it with " A sequence of characters that matches the grammar of line-break ..."

That approach is fine. Still need to watch out for the CRLF versus CR + LF ambiguity.

I added "A whitespace is the longest sequence of characters that could constitute a whitespace." in [lex.whitespace]. I believe this is true for all sort of whitespaces, including comments and
it takes care of both line-breaks and comments

https://isocpp.org/files/papers/D2348R2.pdf
 
 

In [lex.pptoken]:
The instances of "non-whitespace character" with respect to the "cannot be one of the above" case is problematic if the interpretation leaves us with cases where there are Unicode whitespace characters that are a part of neither a preprocessing token nor a whitespace. That's a new situation, which the surrounding wording could not be relied upon to handle in a straightforward manner.

Changed to "each character that is not part of a whitespace and that cannot be one of the above"
Might need further massaging when merging

The "is not part of a whitespace" would only be okay if there was a rule elsewhere to prefer interpreting comments as comments. It seems that rule is missing. The "cannot be" formulation works the same way as the existing "cannot be one of the above": it indicates the preference in interpretation and avoids the question of whether a character has a c 
ertain property while the determination of whether it has that property is in play.

Fixed 
 
 

This could be fixed by replacing:
each non-whitespace character that cannot be one of the above
=>
each character that cannot be considered part of a whitespace and cannot be one of the above

This also happens to fix a pre-existing issue that the wording is rather weak on preferring to interpret comments as comments.