C++ Logo

sg16

Advanced search

Re: [isocpp-core] US 3-030: New-line character sequences in UTF-8 source files

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 8 Nov 2022 21:54:47 -0500
On 11/8/22 9:09 PM, Richard Smith via SG16 wrote:
> On Tue, 8 Nov 2022 at 15:43, Tom Honermann <tom_at_[hidden]> wrote:
>
> On 11/8/22 4:16 PM, Jens Maurer wrote:
> >
> > On 08/11/2022 21.51, Richard Smith via SG16 wrote:
> >> On Tue, 8 Nov 2022 at 11:09, Tom Honermann via Core
> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
> >>
> >> On 11/8/22 10:32 AM, William M. (Mike) Miller wrote:
> >>> On Tue, Nov 8, 2022 at 12:41 AM Tom Honermann via Core
> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
> >>>
> >>> Thanks, Corentin.
> >>>
> >>> I agree that, if ~all existing implementations
> already treat a lone CR as a new-line, then we might as well
> standardize it. However, if some don't, then we'll be adding a
> (probably small) implementation burden for something that I
> suspect is rare. LF and CR+LF are common occurrences. Do you have
> data that shows that lone CR is 1) recognized by ~all existing
> implementations, and 2) is used sufficiently often that it is
> worth standardizing? Do we want to encourage use of lone CR as a
> portable new-line? As mentioned, implementations can still support
> it regardless. Unicode also recognizes U+0085 (NEXT LINE), U+2028
> (LINE SEPARATOR), and U+2029 (PARAGRAPH SEPARATOR) as line-break
> characters.
> >>>
> >>> I think it would be worth adding such analysis to a
> future revision of P2348.
> >>>
> >>> In the interest of time, is anyone opposed to the CWG
> direction of requiring both LF and CR+LF in portable UTF-8 source
> files for C++23 with support for other new-line sequences left to
> a future standard?
> >>>
> >>>
> >>> Actually, CWG changed direction in the late afternoon
> session and decided to accept CR as a line-termination character.
> I'm about to upload drafting implementing that direction for
> discussion today.
> >> Ah, thank you, I'm sorry I missed that discussion.
> >>
> >> That change resolves the inconsistency with P2348 given
> Corentin's explicit claim of the intent in that paper.
> >>
> >> I'm personally happy with this new direction so long as
> implementors have no concerns (and it seems we already have
> confirmation that EDG and Clang have no concerns).
> >>
> >> Given that we already had consensus for P2348 in SG16 and
> EWG, assuming no new objections are raised, ship it.
> >>
> >> Not an objection, mostly just clarifying intent: given a source
> file that contains "#define a \<LF><CR> b", is a conforming
> implementation required to treat the "a" macro as being empty and
> the "b" as being on a separate line (as GCC does), or is it still
> permitted to treat the "b" as being on the same line as the "a"
> because the <LF><CR> is treated as an escaped new-line sequence
> (as Clang does)?
> > No, not if the source file is a "UTF-8 file" per phase 1.
>
> Richard's question wasn't a yes/no question, but Jens' response
> appears
> to favor the gcc behavior in which <LF> and <CR> each contribute a
> new-line. I agree.
>
> Since Unicode does not recognize LF+CR as a single new-line (as it
> does
> for CR+LF), I think the gcc behavior is preferred for portable
> UTF-8 files.
>
>
> I assume you're referring to UTR#13 here? Yeah, seems reasonable to
> follow that in portable Unicode UTF-8 mode.

UAX #14 (Unicode Line Breaking Algorithm)
<https://unicode.org/reports/tr14/> actually (UAX #13 (Unicode Newline
Guidelines) <https://unicode.org/reports/tr13/> was incorporated into
the core specification). I had consulted the "Non-tailorable Line
Breaking Classes" section of Table 1
<https://unicode.org/reports/tr14/#Table1>.

Tom.

> > (Note that clang is internally inconsistent here; see the
> __LINE__ example on the core wiki,
> > which shows that LF CR is considered two lines in other contexts.)
> >
> >> I don't think LF CR is at all common these days -- I think it
> was only really used on the BBC Micro and on Acorn RISC PCs, but
> those both still exist, and Wikipedia says the Acorn C/C++
> compiler suite had a release earlier this year. Hopefully we're
> not going to break line continuations in all of their macros :)
> > Again, such files can be supported in the "non-UTF-8 mode" of
> phase 1.
>
> Agreed.
>
>
> Sure.
>
> Tom.
>
> >
> > Jens
> >
> >
> >> Tom.
> >>
> >>> I don't know about the ubiquity of that support, but the
> EDG front end has it as a build-time configuration option that
> customers can enable or not, as they choose. Here's the
> description of the flag (note that it cites gcc's processing as
> its basis):
> >>>
> >>> /*
> >>> Flag that is TRUE to indicate that carriage return or
> carriage return
> >>> followed by newline can be used as a line terminator
> in GNU-compatible
> >>> modes. This feature is provided to allow files with
> old MacOS line
> >>> terminators to be accepted. The implementation is
> compatible with the way
> >>> in which the GNU compiler handles such line
> terminators. It is disabled by
> >>> default because it is not required by most users.
> >>> */
> >>>
> >>>
> >> _______________________________________________
> >> Core mailing list
> >> Core_at_[hidden] <mailto:Core_at_[hidden]>
> >> Subscription:
> https://lists.isocpp.org/mailman/listinfo.cgi/core
> <https://lists.isocpp.org/mailman/listinfo.cgi/core>
> >> Link to this post:
> http://lists.isocpp.org/core/2022/11/13459.php
> <http://lists.isocpp.org/core/2022/11/13459.php>
> >>
> >>
>
>

Received on 2022-11-09 02:54:49