I was looking at the concern you raise as I was re-reading my paper to present today :)
The intent of my paper was to as close to an editorial fix as possible, so I wantedto minimize the level of wording changes. If I were to action your feedback, I wouldbe concerned that we do not normatively apply the UTF8 handling of carriage returnsto the "source text” produced by the implementation-defined mappings.
My ideal form would have source text be the output of translation phase 1, guaranteedto be UTF8 encoded, and with the carriage return hack for line endings applied, so thatnew-line is synonymous with the new-line glyph.
I agree with one minor correction. The output of TP 1 is neither an encoding scheme like UTF-16BE or UTF-16LE nor an encoding form like UTF-16 or UTF-8. It is more a sequence of abstract characters which we term a sequence of translation character set elements (because we also accommodate Unicode scalar values that aren't mapped to abstract characters).
Which leads me to a minor correction for the paper. The last
paragraph of section 4.1 states:
In this scenario, the text of the program is clearly stored in a source file that is scrawled in McNellis’s handwriting onto a piece of paper and, using the maker’s machine and with the correct implementation-defined mapping from paper source file to the translation character set, is input as a stream of UTF-8 codepoints for the translator to process, completing phase 1 of translation. It is a valid and well-formed program
"a stream of UTF-8 codepoints" is not meaningful. One could say
"a stream of UTF-8 encoded code points" which would
indicate a stream of UTF-8 code units. The input to the translator
is then the decoded sequence of Unicode scalar values (converted
to the translation character set elements).
Tom.
That would be a little more work wording phase 1, and I really do not want to makechanges that seem bigger than might land as an NB editorial comment for C++26.
That said, if the group agrees, I will work on providing that wording as an option sothat it is available for consideration in subsequent reviews.
AlisdairM
On May 14, 2025, at 1:37 PM, Corentin Jabot <corentinjabot@gmail.com> wrote:
I said I'd give feedback to Alisdair on P3556R0 before the meeting.
So briefly:- Using "source file" is fine, ship it.- For "source text", if we want to distinguish phase 1 and post-phase-1 by using that term (I don't love it, but it seems adequate), I think we are missing a definition for it.
Maybe you could improve by adding a paragraph at the end of phase 1:
> This sequence of translation character set elements is termed the _source text_.
(There are probably less awkward ways to do that, but we mention "sequence of translation character set elements" twice in the last two paragraphs of phase 1)
This paper, for me, does not resolve the confusion of the use of the term "header file" (https://github.com/cplusplus/CWG/issues/665) - but I don't think we necessarily want do that in this paper.
Nit: top of page 9, "source text of the source file" seems redundant
--P3657R0
I wish we had a grammar for comments. Other than that, ship it.
Thanks for working on this, Alisdair!
On Wed, May 14, 2025 at 5:32 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
--SG16 will hold a meeting today/tomorrow, Wednesday, May 14th, at 19:30 UTC (timezone conversion).
If you need a .ics file to import into your calendar, you can download it here.
The agenda follows.
- P3658R0: Adjust identifier following new Unicode recommendations.
- P3556R0: Input files are source files.
- P3657R0: A Grammar for Whitespace Characters.
P3658R0, by our good friend Robin Leroy, seeks to adjust the character allowances for identifiers to include a more consistent set of mathematical symbols. This recommendation comes from the UTC in the wake of the adoption of P1949R7 (C++ Identifier Syntax using Unicode Standard Annex 31) for C++23, a paper I'm sure you all remember well. Deployment of P1949 was found to break some existing code that used identifiers containing mathematical symbols that were made invalid by the adoption of P1949R7, but that seemed quite reasonable considering similar identifiers that were not made invalid. The UTC investigated and produced a recommendation for general purpose programming languages as published in UTS #55 (Unicode Source Code Handling). The Unicode stability policy prohibited directly changing the XID_Start and XID_Continue properties, so a Mathematical Compatibility Notation Profile was defined with corresponding ID_Compat_Math_Start and ID_Compat_Math_Continue properties to identify the member characters. The proposed changes are rather straight forward; modify the identifier-start and identifier-continue grammar productions to include characters identified by the new properties.
P3556R0 and P3657R0 come to us courtesy of Alisdair Meredith. These papers are intended to clarify core language wording related to input/source file terminology and the specification of whitespace characters. Both papers are near editorial in nature, but sufficiently complicated to warrant CWG review; SG16 was requested to review since these touch topics near and dear to us. P3556R0 does not include any intended impact to existing implementations. P3657R0 includes two normative changes; it addresses CWG 1655 (Line endings in raw string literals) and it removes a case of IFNDR from [lex.comment]p1 as previously proposed by Corentin in P2348R3 (Whitespaces Wording Revamp).
Tom.
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
Link to this post: http://lists.isocpp.org/sg16/2025/05/4571.php