Thanks, Corentin.

I agree that, if ~all existing implementations already treat a lone CR as a new-line, then we might as well standardize it. However, if some don't, then we'll be adding a (probably small) implementation burden for something that I suspect is rare. LF and CR+LF are common occurrences. Do you have data that shows that lone CR is 1) recognized by ~all existing implementations, and 2) is used sufficiently often that it is worth standardizing? Do we want to encourage use of lone CR as a portable new-line? As mentioned, implementations can still support it regardless. Unicode also recognizes U+0085 (NEXT LINE), U+2028 (LINE SEPARATOR), and U+2029 (PARAGRAPH SEPARATOR) as line-break characters.

I think it would be worth adding such analysis to a future revision of P2348.

In the interest of time, is anyone opposed to the CWG direction of requiring both LF and CR+LF in portable UTF-8 source files for C++23 with support for other new-line sequences left to a future standard?

Tom.

On 11/7/22 8:08 PM, Corentin Jabot wrote:
The designs pillar of P2348
  • Standardize existing practice
  • Standardize existing standards (ie Unicode, except when it conflict with 1, cf the discussion on vertical tab in the paper)
  • Don't deviate from status quo needlessly
  • Don't be inventive
  • Don't burden implementations gratuitously (hence why codepoints outside of basic latin 1 are not consider)

On Tue, Nov 8, 2022 at 2:00 AM Corentin Jabot <corentinjabot@gmail.com> wrote:
P2348 very intentionally supports CR, LF, CR+LF as new lines.
These are the set of new lines on common platforms. I don't know how common it to have lone CR + utf-8 (probably uncommon), but we should not tie
which codepoints are breaking to encoding/utf8-ness, even the presence of implementation defined mapping.

LF+CR is an oddity that is no longer relevant on platforms supported by C++ and/or have not been supported since long before C++ was a thing.
Implementations do not consider that sequence to be single line breaks, and we probably don't want to force that.

P2348 intends to reflect standard practices. (and standards)
It is true that CR is not discussed. But we should note that both ASCII and Unicode, and existing implementations, other tools, etc, will treat CR as a line break.
Why be inventive?





On Tue, Nov 8, 2022 at 1:37 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

CWG reviewed US-3-030 today. Minutes here.

I didn't schedule this issue for discussion in SG16 because I didn't think there was anything interesting for SG16 to weigh in on. However, the CWG discussion turned out to be more interesting than I anticipated. I expect that the resolution direction described below has SG16 consensus, but am sending this message to provide an opportunity to object if anyone has concerns.

During the discussion, I noted that one of the primary goals of P2295 (Support for UTF-8 as a portable source file encoding) was to ensure a portable source file. For a source file to be portable, new-line character sequences must be portably recognized as such. The change proposed with the NB comment left the set of character sequences that constitute a new-line unspecified. I expressed a desire to specify which character sequences constitute a new-line. We then discussed which sequences should be recognized and settled on LF and CR+LF. Support for CR on its own was discussed, but it was felt more evidence and motivation should be provided for that case.

The direction to specify LF and CR+LF as new-line character sequences was believed to be consistent with P2348 (Whitespaces Wording Revamp) which both SG16 and EWG have previously approved (see polling records in the corresponding GitHub issue). However, upon reviewing the wording, it looks to me that P2348 does permit CR by itself to constitute a new-line (see the proposed grammar additions for line-break in [lex.whitespaces]). That seems intentional, but isn't discussed in the paper, so I'm not quite sure (the paper does discuss LF+CR but stops short of proposing support for it).

So, if anyone strongly feels that a lone CR in a UTF-8 source file should be considered a new-line in portable source files, please respond. Please note that implementors can support whatever new-line character sequences desired under the "For any other kind of input file supported by the implementation ..." part of [lex.phases]p1.

Note that the choices made to resolve this issue might require implementations to make changes (e.g., to recognize new-line sequences that they don't today).

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16