C++ Logo

sg16

Advanced search

Re: US 3-030: New-line character sequences in UTF-8 source files

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 8 Nov 2022 02:00:29 +0100
P2348 very intentionally supports CR, LF, CR+LF as new lines.
These are the set of new lines on common platforms. I don't know how common
it to have lone CR + utf-8 (probably uncommon), but we should not tie
which codepoints are breaking to encoding/utf8-ness, even the presence of
implementation defined mapping.

LF+CR is an oddity that is no longer relevant on platforms supported by C++
and/or have not been supported since long before C++ was a thing.
Implementations do not consider that sequence to be single line breaks, and
we probably don't want to force that.

P2348 intends to reflect standard practices. (and standards)
It is true that CR is not discussed. But we should note that both ASCII and
Unicode, and existing implementations, other tools, etc, will treat CR as a
line break.
Why be inventive?





On Tue, Nov 8, 2022 at 1:37 AM Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> CWG reviewed US-3-030 <https://github.com/cplusplus/nbballot/issues/475>
> today. Minutes here
> <https://wiki.edg.com/bin/view/Wg21kona2022/CoreWorkingGroup>.
>
> I didn't schedule this issue for discussion in SG16 because I didn't think
> there was anything interesting for SG16 to weigh in on. However, the CWG
> discussion turned out to be more interesting than I anticipated. I expect
> that the resolution direction described below has SG16 consensus, but am
> sending this message to provide an opportunity to object if anyone has
> concerns.
>
> During the discussion, I noted that one of the primary goals of P2295
> (Support for UTF-8 as a portable source file encoding)
> <https://wg21.link/p2295> was to ensure a portable source file. For a
> source file to be portable, new-line character sequences must be portably
> recognized as such. The change proposed with the NB comment left the set of
> character sequences that constitute a new-line unspecified. I expressed a
> desire to specify which character sequences constitute a new-line. We then
> discussed which sequences should be recognized and settled on LF and CR+LF.
> Support for CR on its own was discussed, but it was felt more evidence and
> motivation should be provided for that case.
>
> The direction to specify LF and CR+LF as new-line character sequences was
> believed to be consistent with P2348 (Whitespaces Wording Revamp)
> <https://wg21.link/p2348> which both SG16 and EWG have previously
> approved (see polling records in the corresponding GitHub issue
> <https://github.com/cplusplus/papers/issues/1027>). However, upon
> reviewing the wording, it looks to me that P2348 does permit CR by itself
> to constitute a new-line (see the proposed grammar additions for
> *line-break* in [lex.whitespaces]). That seems intentional, but isn't
> discussed in the paper, so I'm not quite sure (the paper does discuss LF+CR
> but stops short of proposing support for it).
>
> So, if anyone strongly feels that a lone CR in a UTF-8 source file should
> be considered a new-line in portable source files, please respond. Please
> note that implementors can support whatever new-line character sequences
> desired under the "For any other kind of input file supported by the
> implementation ..." part of [lex.phases]p1
> <http://eel.is/c++draft/lex.phases#1.1>.
>
> Note that the choices made to resolve this issue might require
> implementations to make changes (e.g., to recognize new-line sequences that
> they don't today).
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-11-08 01:00:43