C++ Logo


Advanced search

Re: [isocpp-core] US 3-030: New-line character sequences in UTF-8 source files

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Mon, 7 Nov 2022 20:58:15 -0500
On Mon, Nov 7, 2022 at 8:08 PM Corentin Jabot via Core <
core_at_[hidden]> wrote:

> The designs pillar of P2348
> - Standardize existing practice
> - Standardize existing standards (ie Unicode, except when it conflict
> with 1, cf the discussion on vertical tab in the paper)
> - Don't deviate from status quo needlessly
> - Don't be inventive
> - Don't burden implementations gratuitously (hence why codepoints
> outside of basic latin 1 are not consider)
> In the CWG discussion, it was noted that existing practice around the
treatment of CR differs between platforms. My understanding is that lone CR
is treated as \r in raw string literals by MSVC.

> On Tue, Nov 8, 2022 at 2:00 AM Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>> P2348 very intentionally supports CR, LF, CR+LF as new lines.
>> These are the set of new lines on common platforms. I don't know how
>> common it to have lone CR + utf-8 (probably uncommon), but we should not tie
>> which codepoints are breaking to encoding/utf8-ness, even the presence of
>> implementation defined mapping.
>> LF+CR is an oddity that is no longer relevant on platforms supported by
>> C++ and/or have not been supported since long before C++ was a thing.
>> Implementations do not consider that sequence to be single line breaks,
>> and we probably don't want to force that.
>> P2348 intends to reflect standard practices. (and standards)
>> It is true that CR is not discussed. But we should note that both ASCII
>> and Unicode, and existing implementations, other tools, etc, will treat CR
>> as a line break.
>> Why be inventive?
>> On Tue, Nov 8, 2022 at 1:37 AM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>> CWG reviewed US-3-030 <https://github.com/cplusplus/nbballot/issues/475>
>>> today. Minutes here
>>> <https://wiki.edg.com/bin/view/Wg21kona2022/CoreWorkingGroup>.
>>> I didn't schedule this issue for discussion in SG16 because I didn't
>>> think there was anything interesting for SG16 to weigh in on. However, the
>>> CWG discussion turned out to be more interesting than I anticipated. I
>>> expect that the resolution direction described below has SG16 consensus,
>>> but am sending this message to provide an opportunity to object if anyone
>>> has concerns.
>>> During the discussion, I noted that one of the primary goals of P2295
>>> (Support for UTF-8 as a portable source file encoding)
>>> <https://wg21.link/p2295> was to ensure a portable source file. For a
>>> source file to be portable, new-line character sequences must be portably
>>> recognized as such. The change proposed with the NB comment left the set of
>>> character sequences that constitute a new-line unspecified. I expressed a
>>> desire to specify which character sequences constitute a new-line. We then
>>> discussed which sequences should be recognized and settled on LF and CR+LF.
>>> Support for CR on its own was discussed, but it was felt more evidence and
>>> motivation should be provided for that case.
>>> The direction to specify LF and CR+LF as new-line character sequences
>>> was believed to be consistent with P2348 (Whitespaces Wording Revamp)
>>> <https://wg21.link/p2348> which both SG16 and EWG have previously
>>> approved (see polling records in the corresponding GitHub issue
>>> <https://github.com/cplusplus/papers/issues/1027>). However, upon
>>> reviewing the wording, it looks to me that P2348 does permit CR by itself
>>> to constitute a new-line (see the proposed grammar additions for
>>> *line-break* in [lex.whitespaces]). That seems intentional, but isn't
>>> discussed in the paper, so I'm not quite sure (the paper does discuss LF+CR
>>> but stops short of proposing support for it).
>>> So, if anyone strongly feels that a lone CR in a UTF-8 source file
>>> should be considered a new-line in portable source files, please respond.
>>> Please note that implementors can support whatever new-line character
>>> sequences desired under the "For any other kind of input file supported by
>>> the implementation ..." part of [lex.phases]p1
>>> <http://eel.is/c++draft/lex.phases#1.1>.
>>> Note that the choices made to resolve this issue might require
>>> implementations to make changes (e.g., to recognize new-line sequences that
>>> they don't today).
>>> Tom.
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2022/11/13443.php

Received on 2022-11-08 01:58:45