C++ Logo

sg16

Advanced search

Re: US 3-030: New-line character sequences in UTF-8 source files

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 8 Nov 2022 02:08:31 +0100
The designs pillar of P2348

   - Standardize existing practice
   - Standardize existing standards (ie Unicode, except when it conflict
   with 1, cf the discussion on vertical tab in the paper)
   - Don't deviate from status quo needlessly
   - Don't be inventive
   - Don't burden implementations gratuitously (hence why codepoints
   outside of basic latin 1 are not consider)


On Tue, Nov 8, 2022 at 2:00 AM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

> P2348 very intentionally supports CR, LF, CR+LF as new lines.
> These are the set of new lines on common platforms. I don't know how
> common it to have lone CR + utf-8 (probably uncommon), but we should not tie
> which codepoints are breaking to encoding/utf8-ness, even the presence of
> implementation defined mapping.
>
> LF+CR is an oddity that is no longer relevant on platforms supported by
> C++ and/or have not been supported since long before C++ was a thing.
> Implementations do not consider that sequence to be single line breaks,
> and we probably don't want to force that.
>
> P2348 intends to reflect standard practices. (and standards)
> It is true that CR is not discussed. But we should note that both ASCII
> and Unicode, and existing implementations, other tools, etc, will treat CR
> as a line break.
> Why be inventive?
>
>
>
>
>
> On Tue, Nov 8, 2022 at 1:37 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> CWG reviewed US-3-030 <https://github.com/cplusplus/nbballot/issues/475>
>> today. Minutes here
>> <https://wiki.edg.com/bin/view/Wg21kona2022/CoreWorkingGroup>.
>>
>> I didn't schedule this issue for discussion in SG16 because I didn't
>> think there was anything interesting for SG16 to weigh in on. However, the
>> CWG discussion turned out to be more interesting than I anticipated. I
>> expect that the resolution direction described below has SG16 consensus,
>> but am sending this message to provide an opportunity to object if anyone
>> has concerns.
>>
>> During the discussion, I noted that one of the primary goals of P2295
>> (Support for UTF-8 as a portable source file encoding)
>> <https://wg21.link/p2295> was to ensure a portable source file. For a
>> source file to be portable, new-line character sequences must be portably
>> recognized as such. The change proposed with the NB comment left the set of
>> character sequences that constitute a new-line unspecified. I expressed a
>> desire to specify which character sequences constitute a new-line. We then
>> discussed which sequences should be recognized and settled on LF and CR+LF.
>> Support for CR on its own was discussed, but it was felt more evidence and
>> motivation should be provided for that case.
>>
>> The direction to specify LF and CR+LF as new-line character sequences was
>> believed to be consistent with P2348 (Whitespaces Wording Revamp)
>> <https://wg21.link/p2348> which both SG16 and EWG have previously
>> approved (see polling records in the corresponding GitHub issue
>> <https://github.com/cplusplus/papers/issues/1027>). However, upon
>> reviewing the wording, it looks to me that P2348 does permit CR by itself
>> to constitute a new-line (see the proposed grammar additions for
>> *line-break* in [lex.whitespaces]). That seems intentional, but isn't
>> discussed in the paper, so I'm not quite sure (the paper does discuss LF+CR
>> but stops short of proposing support for it).
>>
>> So, if anyone strongly feels that a lone CR in a UTF-8 source file should
>> be considered a new-line in portable source files, please respond. Please
>> note that implementors can support whatever new-line character sequences
>> desired under the "For any other kind of input file supported by the
>> implementation ..." part of [lex.phases]p1
>> <http://eel.is/c++draft/lex.phases#1.1>.
>>
>> Note that the choices made to resolve this issue might require
>> implementations to make changes (e.g., to recognize new-line sequences that
>> they don't today).
>>
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2022-11-08 01:08:45