On Wed, May 14, 2025 at 7:46 PM Alisdair Meredith <alisdairm@me.com> wrote:
I was looking at the concern you raise as I was re-reading my paper to present today :)
The intent of my paper was to as close to an editorial fix as possible, so I wantedto minimize the level of wording changes. If I were to action your feedback, I wouldbe concerned that we do not normatively apply the UTF8 handling of carriage returnsto the "source text” produced by the implementation-defined mappings.
My ideal form would have source text be the output of translation phase 1, guaranteedto be UTF8 encoded, and with the carriage return hack for line endings applied, so thatnew-line is synonymous with the new-line glyph.
After phase 1, what you call the source text is a "sequence of translation character set elements".Which is isomorphic to "a sequence of Unicode code points/scalar values", which are not encoded(and for which an implementation may choose whatever representation they choose)
The two paths in phase 1 areUTF-8 -> Unicode code pointsImplementation-defined input -> Unicode code points.
Either way, at the start of that sentence, we have a sequence of Unicode code points, which we call translation set elementsAnd everything from that point on applies to that sequence.I'm suggesting that to avoid confusion, we should introduce a term to refer to that sequence, which is what "source text" is in your paper.Except you are introducing the "source text" term without defining it.
(Note that we cannot define "source file"/"input file" or whatever we want to call it.And we specifically don't want to. That's agreed upon already; anything from which we can extract things that represent C++ code qualifies, and we are happy with that term being nebulous.)
+1 to Corentin's above reply.
Tom.
That would be a little more work wording phase 1, and I really do not want to makechanges that seem bigger than might land as an NB editorial comment for C++26.
That said, if the group agrees, I will work on providing that wording as an option sothat it is available for consideration in subsequent reviews.
AlisdairM
On May 14, 2025, at 1:37 PM, Corentin Jabot <corentinjabot@gmail.com> wrote:
I said I'd give feedback to Alisdair on P3556R0 before the meeting.
So briefly:- Using "source file" is fine, ship it.- For "source text", if we want to distinguish phase 1 and post-phase-1 by using that term (I don't love it, but it seems adequate), I think we are missing a definition for it.
Maybe you could improve by adding a paragraph at the end of phase 1:
> This sequence of translation character set elements is termed the _source text_.
(There are probably less awkward ways to do that, but we mention "sequence of translation character set elements" twice in the last two paragraphs of phase 1)
This paper, for me, does not resolve the confusion of the use of the term "header file" (https://github.com/cplusplus/CWG/issues/665) - but I don't think we necessarily want do that in this paper.
Nit: top of page 9, "source text of the source file" seems redundant
--P3657R0
I wish we had a grammar for comments. Other than that, ship it.
Thanks for working on this, Alisdair!
On Wed, May 14, 2025 at 5:32 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
--SG16 will hold a meeting today/tomorrow, Wednesday, May 14th, at 19:30 UTC (timezone conversion).
If you need a .ics file to import into your calendar, you can download it here.
The agenda follows.
- P3658R0: Adjust identifier following new Unicode recommendations.
- P3556R0: Input files are source files.
- P3657R0: A Grammar for Whitespace Characters.
P3658R0, by our good friend Robin Leroy, seeks to adjust the character allowances for identifiers to include a more consistent set of mathematical symbols. This recommendation comes from the UTC in the wake of the adoption of P1949R7 (C++ Identifier Syntax using Unicode Standard Annex 31) for C++23, a paper I'm sure you all remember well. Deployment of P1949 was found to break some existing code that used identifiers containing mathematical symbols that were made invalid by the adoption of P1949R7, but that seemed quite reasonable considering similar identifiers that were not made invalid. The UTC investigated and produced a recommendation for general purpose programming languages as published in UTS #55 (Unicode Source Code Handling). The Unicode stability policy prohibited directly changing the XID_Start and XID_Continue properties, so a Mathematical Compatibility Notation Profile was defined with corresponding ID_Compat_Math_Start and ID_Compat_Math_Continue properties to identify the member characters. The proposed changes are rather straight forward; modify the identifier-start and identifier-continue grammar productions to include characters identified by the new properties.
P3556R0 and P3657R0 come to us courtesy of Alisdair Meredith. These papers are intended to clarify core language wording related to input/source file terminology and the specification of whitespace characters. Both papers are near editorial in nature, but sufficiently complicated to warrant CWG review; SG16 was requested to review since these touch topics near and dear to us. P3556R0 does not include any intended impact to existing implementations. P3657R0 includes two normative changes; it addresses CWG 1655 (Line endings in raw string literals) and it removes a case of IFNDR from [lex.comment]p1 as previously proposed by Corentin in P2348R3 (Whitespaces Wording Revamp).
Tom.
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
Link to this post: http://lists.isocpp.org/sg16/2025/05/4571.php