C++ Logo


Advanced search

Re: [isocpp-core] US 3-030: New-line character sequences in UTF-8 source files

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 8 Nov 2022 00:43:51 -0500
On 11/7/22 8:58 PM, Hubert Tong wrote:
> On Mon, Nov 7, 2022 at 8:08 PM Corentin Jabot via Core
> <core_at_[hidden]> wrote:
> The designs pillar of P2348
> * Standardize existing practice
> * Standardize existing standards (ie Unicode, except when it
> conflict with 1, cf the discussion on vertical tab in the paper)
> * Don't deviate from status quo needlessly
> * Don't be inventive
> * Don't burden implementations gratuitously (hence why
> codepoints outside of basic latin 1 are not consider)
> In the CWG discussion, it was noted that existing practice around the
> treatment of CR differs between platforms. My understanding is that
> lone CR is treated as \r in raw string literals by MSVC.

CWG1655 <https://wg21.link/cwg1655> and CWG1709
<https://wg21.link/cwg1709> are relevant here. I haven't verified MSVC's


> On Tue, Nov 8, 2022 at 2:00 AM Corentin Jabot
> <corentinjabot_at_[hidden]> wrote:
> P2348 very intentionally supports CR, LF, CR+LF as new lines.
> These are the set of new lines on common platforms. I don't
> know how common it to have lone CR + utf-8 (probably
> uncommon), but we should not tie
> which codepoints are breaking to encoding/utf8-ness, even the
> presence of implementation defined mapping.
> LF+CR is an oddity that is no longer relevant on platforms
> supported by C++ and/or have not been supported since long
> before C++ was a thing.
> Implementations do not consider that sequence to be single
> line breaks, and we probably don't want to force that.
> P2348 intends to reflect standard practices. (and standards)
> It is true that CR is not discussed. But we should note that
> both ASCII and Unicode, and existing implementations, other
> tools, etc, will treat CR as a line break.
> Why be inventive?
> On Tue, Nov 8, 2022 at 1:37 AM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
> CWG reviewed US-3-030
> <https://github.com/cplusplus/nbballot/issues/475> today.
> Minutes here
> <https://wiki.edg.com/bin/view/Wg21kona2022/CoreWorkingGroup>.
> I didn't schedule this issue for discussion in SG16
> because I didn't think there was anything interesting for
> SG16 to weigh in on. However, the CWG discussion turned
> out to be more interesting than I anticipated. I expect
> that the resolution direction described below has SG16
> consensus, but am sending this message to provide an
> opportunity to object if anyone has concerns.
> During the discussion, I noted that one of the primary
> goals of P2295 (Support for UTF-8 as a portable source
> file encoding) <https://wg21.link/p2295> was to ensure a
> portable source file. For a source file to be portable,
> new-line character sequences must be portably recognized
> as such. The change proposed with the NB comment left the
> set of character sequences that constitute a new-line
> unspecified. I expressed a desire to specify which
> character sequences constitute a new-line. We then
> discussed which sequences should be recognized and settled
> on LF and CR+LF. Support for CR on its own was discussed,
> but it was felt more evidence and motivation should be
> provided for that case.
> The direction to specify LF and CR+LF as new-line
> character sequences was believed to be consistent with
> P2348 (Whitespaces Wording Revamp)
> <https://wg21.link/p2348> which both SG16 and EWG have
> previously approved (see polling records in the
> corresponding GitHub issue
> <https://github.com/cplusplus/papers/issues/1027>).
> However, upon reviewing the wording, it looks to me that
> P2348 does permit CR by itself to constitute a new-line
> (see the proposed grammar additions for /line-break/ in
> [lex.whitespaces]). That seems intentional, but isn't
> discussed in the paper, so I'm not quite sure (the paper
> does discuss LF+CR but stops short of proposing support
> for it).
> So, if anyone strongly feels that a lone CR in a UTF-8
> source file should be considered a new-line in portable
> source files, please respond. Please note that
> implementors can support whatever new-line character
> sequences desired under the "For any other kind of input
> file supported by the implementation ..." part of
> [lex.phases]p1 <http://eel.is/c++draft/lex.phases#1.1>.
> Note that the choices made to resolve this issue might
> require implementations to make changes (e.g., to
> recognize new-line sequences that they don't today).
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2022/11/13443.php

Received on 2022-11-08 05:43:55