C++ Logo

sg16

Advanced search

Re: Agenda for the 2022-05-25 SG16 telecon

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sun, 22 May 2022 17:40:40 -0400
On Sun, May 22, 2022 at 5:11 PM Hubert Tong
<hubert.reinterpretcast_at_[hidden]> wrote:
>
> On Sun, May 22, 2022 at 2:24 AM Jens Maurer via SG16
> <sg16_at_[hidden]> wrote:
> >
> > On 20/05/2022 18.34, Tom Honermann via SG16 wrote:
> > > * L2/22-072R: Proposal for amendments to UAX#9 and UAX#31 <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf>
> > > o Review for familiarity and relevance to P1949: C++ Identifier Syntax using Unicode Standard Annex 31 <https://wg21.link/p1949>.
> > >
> > > L2/22-072R <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf> was produced by the Unicode Source Code Ad-Hoc Group and adopted in April into the proposed updates for Unicode 15 per the Draft Minutes of UTC Meeting 171 <https://www.unicode.org/L2/L2022/22061.htm#171-C25>. Thanks are owed to Robin Leroy (CC'd) for bringing this paper to our attention. The paper discusses handling of source code that contains characters that have right-to-left (RTL) directionality. The changes made to UAX#9 (Unicode Bidirectional Algorithm) <https://www.unicode.org/reports/tr9/proposed.html#HL4Example2> (in yellow highlight) are concerned with presentation of source code and is therefore more of a concern for SG15 (Tooling) where it would be applicable to compilers (e.g., in diagnostics), editors, code review tools, etc... The changes to UAX#31 (Unicode Identifier and Pattern Syntax) <https://www.unicode.org/reports/tr31/proposed.html#Pattern_Syntax> (in yellow highlight) clarify that
> > > rule UAX31-R3 <https://unicode.org/reports/tr31/#R3> is applicable to programming languages and present an example illustrating how use of LEFT-TO-RIGHT MARK (LRM) and RIGHT-TO-LEFT MARK (RLM) as whitespace characters (but not in isolation) may be desirable so that source code rendered as plain text does not present the source code in a confusing or surprising manner. The adopted changes suggest (at least) the following items for us to consider:
> >
> > The most interesting shift in viewpoint that occurs with this update
> > is that UAX31-R3 is indeed intended to apply to the lexing of
> > programming language source code. The use of "pattern language"
> > without a mention of source code in the introductory paragraph was
> > not helpful. Mentally replacing "pattern language" with "lexing
> > rules" while reading UAX31-R3 helped quite a lot.
> >
> > > 1. [uaxid.pattern]p2 <http://eel.is/c++draft/uaxid.pattern#2>, as added by P1949 <https://wg21.link/p1949>, states that UAX31-R3 <https://unicode.org/reports/tr31/#R3> is not applicable to C++ but in light of the updates above, that is not correct. The entry should be updated to state our conformance and possibly declare a profile for our use of Pattern_White_Space <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_White_Space%3A%5D&g=&i=> and Pattern_Syntax <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_Syntax%3A%5D&g=&i=> characters.
> >
> > It seems we need to check that the C++ understanding of whitespace is
> > a subset of Pattern_White_Space.
>
> It is a subset (exactly the subset within U+0000 to U+007F, inclusive;
> except U+000D CARRIAGE RETURN is not a member of the basic character
> set(!)).

Once we have P2348, adding U+000D CARRIAGE RETURN to the basic
character set should actually be editorial.

Nothing in the wording as it is now precludes the grammar from having
characters outside the basic character set be significant to parsing.
\u000d is already ill-formed outside of string and character literals,
and it is already in the basic literal character set.


>
> Also, whether the requirement means that a language has exactly
> Pattern_White_Space as its definition for whitespace is in question.
>
> > We also need to check that the C++
> > understanding of characters outside those in identifiers and whitespace
> > is a subset of Pattern_Syntax, and document the results in Annex E.
>
> In C++, characters outside of identifiers can include those that may
> appear in identifiers (pp-numbers are interesting). The set of
> characters outside of identifiers, whitespace, "literals", and
> comments are a subset of Pattern_Syntax.
>
> >
> > The existing note
> >
> > "When meeting this requirement, all characters except those that have the
> > Pattern_White_Space or Pattern_Syntax properties are available for use as
> > identifiers or literals."
> >
> > is likely not true for C++; there are most likely characters that cannot
> > be used in identifiers (because they're not in XID_Start or XID_Continue),
> > yet do not match Pattern_White_Space or Pattern_Syntax, either.
> >
> > > 2. Per the example added to UAX31-R3 <https://unicode.org/reports/tr31/#R3>, consider allowing LRM and RLM to appear in whitespace (this would be an additional change to consider on top of P2348: Whitespaces Wording Revamp <https://wg21.link/p2348> after C++23 pending updated Unicode guidance).
> >
> > I'm a bit worried about the phrasing "implicit directional marks".
> > Are those supposed to appear out of thin air at some point during
> > translation? Or is that just some descriptive term without deeper
> > meaning?
> >
> > Is there a readable list of Pattern_Whitespace characters somewhere?
>
> 0009..000D ; Pattern_White_Space # Cc [5] <control-0009>..<control-000D>
> 0020 ; Pattern_White_Space # Zs SPACE
> 0085 ; Pattern_White_Space # Cc <control-0085>
> 200E..200F ; Pattern_White_Space # Cf [2] LEFT-TO-RIGHT
> MARK..RIGHT-TO-LEFT MARK
> 2028 ; Pattern_White_Space # Zl LINE SEPARATOR
> 2029 ; Pattern_White_Space # Zp PARAGRAPH SEPARATOR
>
> >
> > > 3. Consider proposing recommended display behaviors to SG15; presumably inline with HL4 from UAX#9 section 4.3, "Higher-Level Protocols" <https://unicode.org/reports/tr9/#Higher-Level_Protocols>. My understanding is that Microsoft Visual Studio implements this behavior. Opportunities for diagnostic improvements can be seen at https://godbolt.org/z/MM1xE5dM1 (note that the carat position is not aligned with the identifier it intends to highlight; this is because the code display and carat location are not in sync with regard to how RTL characters affect presentation).
> >
> > I'm not sure SG15 has the right people to address this; they
> > seem to be mostly focused on build systems and modules these
> > days.
> >
> > Jens
> > --
> > SG16 mailing list
> > SG16_at_[hidden]
> > https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2022-05-22 21:41:09