C++ Logo

sg16

Advanced search

Re: US 3-030: New-line character sequences in UTF-8 source files

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 8 Nov 2022 00:41:31 -0500
Thanks, Corentin.

I agree that, if ~all existing implementations already treat a lone CR
as a new-line, then we might as well standardize it. However, if some
don't, then we'll be adding a (probably small) implementation burden for
something that I suspect is rare. LF and CR+LF are common occurrences.
Do you have data that shows that lone CR is 1) recognized by ~all
existing implementations, and 2) is used sufficiently often that it is
worth standardizing? Do we want to encourage use of lone CR as a
portable new-line? As mentioned, implementations can still support it
regardless. Unicode also recognizes U+0085 (NEXT LINE), U+2028 (LINE
SEPARATOR), and U+2029 (PARAGRAPH SEPARATOR) as line-break characters.

I think it would be worth adding such analysis to a future revision of
P2348.

In the interest of time, is anyone opposed to the CWG direction of
requiring both LF and CR+LF in portable UTF-8 source files for C++23
with support for other new-line sequences left to a future standard?

Tom.

On 11/7/22 8:08 PM, Corentin Jabot wrote:
> The designs pillar of P2348
>
> * Standardize existing practice
> * Standardize existing standards (ie Unicode, except when it
> conflict with 1, cf the discussion on vertical tab in the paper)
> * Don't deviate from status quo needlessly
> * Don't be inventive
> * Don't burden implementations gratuitously (hence why codepoints
> outside of basic latin 1 are not consider)
>
>
> On Tue, Nov 8, 2022 at 2:00 AM Corentin Jabot
> <corentinjabot_at_[hidden]> wrote:
>
> P2348 very intentionally supports CR, LF, CR+LF as new lines.
> These are the set of new lines on common platforms. I don't know
> how common it to have lone CR + utf-8 (probably uncommon), but we
> should not tie
> which codepoints are breaking to encoding/utf8-ness, even the
> presence of implementation defined mapping.
>
> LF+CR is an oddity that is no longer relevant on platforms
> supported by C++ and/or have not been supported since long before
> C++ was a thing.
> Implementations do not consider that sequence to be single line
> breaks, and we probably don't want to force that.
>
> P2348 intends to reflect standard practices. (and standards)
> It is true that CR is not discussed. But we should note that both
> ASCII and Unicode, and existing implementations, other tools, etc,
> will treat CR as a line break.
> Why be inventive?
>
>
>
>
>
> On Tue, Nov 8, 2022 at 1:37 AM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> CWG reviewed US-3-030
> <https://github.com/cplusplus/nbballot/issues/475> today.
> Minutes here
> <https://wiki.edg.com/bin/view/Wg21kona2022/CoreWorkingGroup>.
>
> I didn't schedule this issue for discussion in SG16 because I
> didn't think there was anything interesting for SG16 to weigh
> in on. However, the CWG discussion turned out to be more
> interesting than I anticipated. I expect that the resolution
> direction described below has SG16 consensus, but am sending
> this message to provide an opportunity to object if anyone has
> concerns.
>
> During the discussion, I noted that one of the primary goals
> of P2295 (Support for UTF-8 as a portable source file
> encoding) <https://wg21.link/p2295> was to ensure a portable
> source file. For a source file to be portable, new-line
> character sequences must be portably recognized as such. The
> change proposed with the NB comment left the set of character
> sequences that constitute a new-line unspecified. I expressed
> a desire to specify which character sequences constitute a
> new-line. We then discussed which sequences should be
> recognized and settled on LF and CR+LF. Support for CR on its
> own was discussed, but it was felt more evidence and
> motivation should be provided for that case.
>
> The direction to specify LF and CR+LF as new-line character
> sequences was believed to be consistent with P2348
> (Whitespaces Wording Revamp) <https://wg21.link/p2348> which
> both SG16 and EWG have previously approved (see polling
> records in the corresponding GitHub issue
> <https://github.com/cplusplus/papers/issues/1027>). However,
> upon reviewing the wording, it looks to me that P2348 does
> permit CR by itself to constitute a new-line (see the proposed
> grammar additions for /line-break/ in [lex.whitespaces]). That
> seems intentional, but isn't discussed in the paper, so I'm
> not quite sure (the paper does discuss LF+CR but stops short
> of proposing support for it).
>
> So, if anyone strongly feels that a lone CR in a UTF-8 source
> file should be considered a new-line in portable source files,
> please respond. Please note that implementors can support
> whatever new-line character sequences desired under the "For
> any other kind of input file supported by the implementation
> ..." part of [lex.phases]p1
> <http://eel.is/c++draft/lex.phases#1.1>.
>
> Note that the choices made to resolve this issue might require
> implementations to make changes (e.g., to recognize new-line
> sequences that they don't today).
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-11-08 05:41:35