C++ Logo

sg16

Advanced search

US 3-030: New-line character sequences in UTF-8 source files

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 7 Nov 2022 19:37:36 -0500
CWG reviewed US-3-030 <https://github.com/cplusplus/nbballot/issues/475>
today. Minutes here
<https://wiki.edg.com/bin/view/Wg21kona2022/CoreWorkingGroup>.

I didn't schedule this issue for discussion in SG16 because I didn't
think there was anything interesting for SG16 to weigh in on. However,
the CWG discussion turned out to be more interesting than I anticipated.
I expect that the resolution direction described below has SG16
consensus, but am sending this message to provide an opportunity to
object if anyone has concerns.

During the discussion, I noted that one of the primary goals of P2295
(Support for UTF-8 as a portable source file encoding)
<https://wg21.link/p2295> was to ensure a portable source file. For a
source file to be portable, new-line character sequences must be
portably recognized as such. The change proposed with the NB comment
left the set of character sequences that constitute a new-line
unspecified. I expressed a desire to specify which character sequences
constitute a new-line. We then discussed which sequences should be
recognized and settled on LF and CR+LF. Support for CR on its own was
discussed, but it was felt more evidence and motivation should be
provided for that case.

The direction to specify LF and CR+LF as new-line character sequences
was believed to be consistent with P2348 (Whitespaces Wording Revamp)
<https://wg21.link/p2348> which both SG16 and EWG have previously
approved (see polling records in the corresponding GitHub issue
<https://github.com/cplusplus/papers/issues/1027>). However, upon
reviewing the wording, it looks to me that P2348 does permit CR by
itself to constitute a new-line (see the proposed grammar additions for
/line-break/ in [lex.whitespaces]). That seems intentional, but isn't
discussed in the paper, so I'm not quite sure (the paper does discuss
LF+CR but stops short of proposing support for it).

So, if anyone strongly feels that a lone CR in a UTF-8 source file
should be considered a new-line in portable source files, please
respond. Please note that implementors can support whatever new-line
character sequences desired under the "For any other kind of input file
supported by the implementation ..." part of [lex.phases]p1
<http://eel.is/c++draft/lex.phases#1.1>.

Note that the choices made to resolve this issue might require
implementations to make changes (e.g., to recognize new-line sequences
that they don't today).

Tom.

Received on 2022-11-08 00:37:38