ISOCPP sg16 List: Re: [isocpp-sg16] Agenda for the 2025-05-14 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 14 May 2025 14:29:34 -0400

On 5/14/25 1:46 PM, Alisdair Meredith wrote:
> I was looking at the concern you raise as I was re-reading my paper to
> present today :)
>
> The intent of my paper was to as close to an editorial fix as
> possible, so I wanted
> to minimize the level of wording changes. If I were to action your
> feedback, I would
> be concerned that we do not normatively apply the UTF8 handling of
> carriage returns
> to the "source text” produced by the implementation-defined mappings.
>
> My ideal form would have source text be the output of translation
> phase 1, guaranteed
> to be UTF8 encoded, and with the carriage return hack for line endings
> applied, so that
> new-line is synonymous with the new-line glyph.

I agree with one minor correction. The output of TP 1 is neither an
encoding scheme like UTF-16BE or UTF-16LE nor an encoding form like
UTF-16 or UTF-8. It is more a sequence of abstract characters which we
term a sequence of translation character set elements (because we also
accommodate Unicode scalar values that aren't mapped to abstract
characters).

Which leads me to a minor correction for the paper. The last paragraph
of section 4.1 states:

    In this scenario, the text of the program is clearly stored in a
    source file that is scrawled in McNellis’s handwriting onto a piece
    of paper and, using the maker’s machine and with the correct
    implementation-defined mapping from paper source file to the
    translation character set, is input as *a stream of UTF-8
    codepoints* for the translator to process, completing phase 1 of
    translation. It is a valid and well-formed program

"a stream of UTF-8 codepoints" is not meaningful. One could say "a
stream of UTF-8 *encoded* code points" which would indicate a stream of
UTF-8 code units. The input to the translator is then the decoded
sequence of Unicode scalar values (converted to the translation
character set elements).

Tom.

>
> That would be a little more work wording phase 1, and I really do not
> want to make
> changes that seem bigger than might land as an NB editorial comment
> for C++26.
>
> That said, if the group agrees, I will work on providing that wording
> as an option so
> that it is available for consideration in subsequent reviews.
>
> AlisdairM
>
>> On May 14, 2025, at 1:37 PM, Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>> I said I'd give feedback to Alisdair on P3556R0 before the meeting.
>>
>> So briefly:
>> - Using "source file" is fine, ship it.
>> - For "source text", if we want to distinguish phase 1 and
>> post-phase-1 by using that term (I don't love it, but it seems
>> adequate), I think we are missing a definition for it.
>>
>> Maybe you could improve by adding a paragraph at the end of phase 1:
>>
>> > This sequence of translation character set elements is termed
>> the _source text_.
>>
>> (There are probably less awkward ways to do that, but we mention
>> "sequence of translation character set elements" twice in the last
>> two paragraphs of phase 1)
>>
>> This paper, for me, does not resolve the confusion of the use of the
>> term "header file" (https://github.com/cplusplus/CWG/issues/665) -
>> but I don't think we necessarily want do that in this paper.
>>
>>
>> Nit: top of page 9, "source text of the source file" seems redundant
>>
>> --
>> P3657R0
>>
>> I wish we had a grammar for comments. Other than that, ship it.
>>
>>
>> Thanks for working on this, Alisdair!
>>
>> On Wed, May 14, 2025 at 5:32 AM Tom Honermann via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> SG16 will hold a meeting *today/tomorrow*, Wednesday, May 14th,
>> at 19:30 UTC (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20250514T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>
>> If you need a .ics file to import into your calendar, you can
>> download it here
>> <https://documents.isocpp.org/remote.php/dav/public-calendars/R7imgS2LJD9xfeWN/94A3D3A0-70B9-4847-935F-9453DB2BB216.ics?export>.
>>
>> The agenda follows.
>>
>> * P3658R0: Adjust /identifier/ following new Unicode
>> recommendations <https://wg21.link/p3658r0>.
>> * P3556R0: Input files are source files
>> <https://wg21.link/p3556r0>.
>> * P3657R0: A Grammar for Whitespace Characters
>> <https://wg21.link/p3657r0>.
>>
>> *P3658R0*, by our good friend Robin Leroy, seeks to adjust the
>> character allowances for identifiers to include a more consistent
>> set of mathematical symbols. This recommendation comes from the
>> UTC in the wake of the adoption of P1949R7 (C++ Identifier Syntax
>> using Unicode Standard Annex 31) <https://wg21.link/p1949r7> for
>> C++23, a paper I'm sure you all remember well. Deployment of
>> P1949 was found to break some existing code that used identifiers
>> containing mathematical symbols that were made invalid by the
>> adoption of P1949R7, but that seemed quite reasonable considering
>> similar identifiers that were not made invalid. The UTC
>> investigated and produced a recommendation for general purpose
>> programming languages as published in UTS #55 (Unicode Source
>> Code Handling) <https://www.unicode.org/reports/tr55/>. The
>> Unicode stability policy prohibited directly changing the
>> XID_Start and XID_Continue properties, so a Mathematical
>> Compatibility Notation Profile
>> <https://www.unicode.org/reports/tr31/#Mathematical_Compatibility_Notation_Profile>
>> was defined with corresponding ID_Compat_Math_Start
>> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AID_Compat_Math_Start%3A%5D&g=&i=>
>> and ID_Compat_Math_Continue
>> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AID_Compat_Math_Continue%3A%5D&g=&i=>
>> properties to identify the member characters. The proposed
>> changes are rather straight forward; modify the
>> /identifier-start/
>> <https://eel.is/c++draft/lex.name#nt:identifier-start> and
>> /identifier-continue/
>> <https://eel.is/c++draft/lex.name#nt:identifier-continue> grammar
>> productions to include characters identified by the new properties.
>>
>> *P3556R0* and *P3657R0* come to us courtesy of Alisdair Meredith.
>> These papers are intended to clarify core language wording
>> related to input/source file terminology and the specification of
>> whitespace characters. Both papers are near editorial in nature,
>> but sufficiently complicated to warrant CWG review; SG16 was
>> requested to review since these touch topics near and dear to us.
>> P3556R0 does not include any intended impact to existing
>> implementations. P3657R0 includes two normative changes; it
>> addresses CWG 1655 (Line endings in raw string literals)
>> <https://wg21.link/cwg1655> and it removes a case of IFNDR from
>> [lex.comment]p1
>> <https://eel.is/c++draft/lex.comment#1.sentence-4> as previously
>> proposed by Corentin in P2348R3 (Whitespaces Wording Revamp)
>> <https://wg21.link/p2348r3>.
>>
>>
>>
>> Tom.
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>> Link to this post: http://lists.isocpp.org/sg16/2025/05/4571.php
>>
>

Received on 2025-05-14 18:29:40