On Tue, Nov 8, 2022 at 2:00 AM Corentin Jabot <corentinjabot@gmail.com> wrote:

P2348 very intentionally supports CR, LF, CR+LF as new lines.
These are the set of new lines on common platforms. I don't know how common it to have lone CR + utf-8 (probably uncommon), but we should not tie

which codepoints are breaking to encoding/utf8-ness, even the presence of implementation defined mapping.

LF+CR is an oddity that is no longer relevant on platforms supported by C++ and/or have not been supported since long before C++ was a thing.

Implementations do not consider that sequence to be single line breaks, and we probably don't want to force that.

P2348 intends to reflect standard practices. (and standards)

It is true that CR is not discussed. But we should note that both ASCII and Unicode, and existing implementations, other tools, etc, will treat CR as a line break.

Why be inventive?

On Tue, Nov 8, 2022 at 1:37 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

CWG reviewed US-3-030 today. Minutes here.

I didn't schedule this issue for discussion in SG16 because I didn't think there was anything interesting for SG16 to weigh in on. However, the CWG discussion turned out to be more interesting than I anticipated. I expect that the resolution direction described below has SG16 consensus, but am sending this message to provide an opportunity to object if anyone has concerns.

During the discussion, I noted that one of the primary goals of P2295 (Support for UTF-8 as a portable source file encoding) was to ensure a portable source file. For a source file to be portable, new-line character sequences must be portably recognized as such. The change proposed with the NB comment left the set of character sequences that constitute a new-line unspecified. I expressed a desire to specify which character sequences constitute a new-line. We then discussed which sequences should be recognized and settled on LF and CR+LF. Support for CR on its own was discussed, but it was felt more evidence and motivation should be provided for that case.

The direction to specify LF and CR+LF as new-line character sequences was believed to be consistent with P2348 (Whitespaces Wording Revamp) which both SG16 and EWG have previously approved (see polling records in the corresponding GitHub issue). However, upon reviewing the wording, it looks to me that P2348 does permit CR by itself to constitute a new-line (see the proposed grammar additions for line-break in [lex.whitespaces]). That seems intentional, but isn't discussed in the paper, so I'm not quite sure (the paper does discuss LF+CR but stops short of proposing support for it).

So, if anyone strongly feels that a lone CR in a UTF-8 source file should be considered a new-line in portable source files, please respond. Please note that implementors can support whatever new-line character sequences desired under the "For any other kind of input file supported by the implementation ..." part of [lex.phases]p1.

Note that the choices made to resolve this issue might require implementations to make changes (e.g., to recognize new-line sequences that they don't today).

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16