Date: Sun, 22 May 2022 17:11:08 -0400
On Sun, May 22, 2022 at 2:24 AM Jens Maurer via SG16
<sg16_at_[hidden]> wrote:
>
> On 20/05/2022 18.34, Tom Honermann via SG16 wrote:
> > * L2/22-072R: Proposal for amendments to UAX#9 and UAX#31 <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf>
> > o Review for familiarity and relevance to P1949: C++ Identifier Syntax using Unicode Standard Annex 31 <https://wg21.link/p1949>.
> >
> > L2/22-072R <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf> was produced by the Unicode Source Code Ad-Hoc Group and adopted in April into the proposed updates for Unicode 15 per the Draft Minutes of UTC Meeting 171 <https://www.unicode.org/L2/L2022/22061.htm#171-C25>. Thanks are owed to Robin Leroy (CC'd) for bringing this paper to our attention. The paper discusses handling of source code that contains characters that have right-to-left (RTL) directionality. The changes made to UAX#9 (Unicode Bidirectional Algorithm) <https://www.unicode.org/reports/tr9/proposed.html#HL4Example2> (in yellow highlight) are concerned with presentation of source code and is therefore more of a concern for SG15 (Tooling) where it would be applicable to compilers (e.g., in diagnostics), editors, code review tools, etc... The changes to UAX#31 (Unicode Identifier and Pattern Syntax) <https://www.unicode.org/reports/tr31/proposed.html#Pattern_Syntax> (in yellow highlight) clarify that
> > rule UAX31-R3 <https://unicode.org/reports/tr31/#R3> is applicable to programming languages and present an example illustrating how use of LEFT-TO-RIGHT MARK (LRM) and RIGHT-TO-LEFT MARK (RLM) as whitespace characters (but not in isolation) may be desirable so that source code rendered as plain text does not present the source code in a confusing or surprising manner. The adopted changes suggest (at least) the following items for us to consider:
>
> The most interesting shift in viewpoint that occurs with this update
> is that UAX31-R3 is indeed intended to apply to the lexing of
> programming language source code. The use of "pattern language"
> without a mention of source code in the introductory paragraph was
> not helpful. Mentally replacing "pattern language" with "lexing
> rules" while reading UAX31-R3 helped quite a lot.
>
> > 1. [uaxid.pattern]p2 <http://eel.is/c++draft/uaxid.pattern#2>, as added by P1949 <https://wg21.link/p1949>, states that UAX31-R3 <https://unicode.org/reports/tr31/#R3> is not applicable to C++ but in light of the updates above, that is not correct. The entry should be updated to state our conformance and possibly declare a profile for our use of Pattern_White_Space <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_White_Space%3A%5D&g=&i=> and Pattern_Syntax <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_Syntax%3A%5D&g=&i=> characters.
>
> It seems we need to check that the C++ understanding of whitespace is
> a subset of Pattern_White_Space.
It is a subset (exactly the subset within U+0000 to U+007F, inclusive;
except U+000D CARRIAGE RETURN is not a member of the basic character
set(!)).
Also, whether the requirement means that a language has exactly
Pattern_White_Space as its definition for whitespace is in question.
> We also need to check that the C++
> understanding of characters outside those in identifiers and whitespace
> is a subset of Pattern_Syntax, and document the results in Annex E.
In C++, characters outside of identifiers can include those that may
appear in identifiers (pp-numbers are interesting). The set of
characters outside of identifiers, whitespace, "literals", and
comments are a subset of Pattern_Syntax.
>
> The existing note
>
> "When meeting this requirement, all characters except those that have the
> Pattern_White_Space or Pattern_Syntax properties are available for use as
> identifiers or literals."
>
> is likely not true for C++; there are most likely characters that cannot
> be used in identifiers (because they're not in XID_Start or XID_Continue),
> yet do not match Pattern_White_Space or Pattern_Syntax, either.
>
> > 2. Per the example added to UAX31-R3 <https://unicode.org/reports/tr31/#R3>, consider allowing LRM and RLM to appear in whitespace (this would be an additional change to consider on top of P2348: Whitespaces Wording Revamp <https://wg21.link/p2348> after C++23 pending updated Unicode guidance).
>
> I'm a bit worried about the phrasing "implicit directional marks".
> Are those supposed to appear out of thin air at some point during
> translation? Or is that just some descriptive term without deeper
> meaning?
>
> Is there a readable list of Pattern_Whitespace characters somewhere?
0009..000D ; Pattern_White_Space # Cc [5] <control-0009>..<control-000D>
0020 ; Pattern_White_Space # Zs SPACE
0085 ; Pattern_White_Space # Cc <control-0085>
200E..200F ; Pattern_White_Space # Cf [2] LEFT-TO-RIGHT
MARK..RIGHT-TO-LEFT MARK
2028 ; Pattern_White_Space # Zl LINE SEPARATOR
2029 ; Pattern_White_Space # Zp PARAGRAPH SEPARATOR
>
> > 3. Consider proposing recommended display behaviors to SG15; presumably inline with HL4 from UAX#9 section 4.3, "Higher-Level Protocols" <https://unicode.org/reports/tr9/#Higher-Level_Protocols>. My understanding is that Microsoft Visual Studio implements this behavior. Opportunities for diagnostic improvements can be seen at https://godbolt.org/z/MM1xE5dM1 (note that the carat position is not aligned with the identifier it intends to highlight; this is because the code display and carat location are not in sync with regard to how RTL characters affect presentation).
>
> I'm not sure SG15 has the right people to address this; they
> seem to be mostly focused on build systems and modules these
> days.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
<sg16_at_[hidden]> wrote:
>
> On 20/05/2022 18.34, Tom Honermann via SG16 wrote:
> > * L2/22-072R: Proposal for amendments to UAX#9 and UAX#31 <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf>
> > o Review for familiarity and relevance to P1949: C++ Identifier Syntax using Unicode Standard Annex 31 <https://wg21.link/p1949>.
> >
> > L2/22-072R <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf> was produced by the Unicode Source Code Ad-Hoc Group and adopted in April into the proposed updates for Unicode 15 per the Draft Minutes of UTC Meeting 171 <https://www.unicode.org/L2/L2022/22061.htm#171-C25>. Thanks are owed to Robin Leroy (CC'd) for bringing this paper to our attention. The paper discusses handling of source code that contains characters that have right-to-left (RTL) directionality. The changes made to UAX#9 (Unicode Bidirectional Algorithm) <https://www.unicode.org/reports/tr9/proposed.html#HL4Example2> (in yellow highlight) are concerned with presentation of source code and is therefore more of a concern for SG15 (Tooling) where it would be applicable to compilers (e.g., in diagnostics), editors, code review tools, etc... The changes to UAX#31 (Unicode Identifier and Pattern Syntax) <https://www.unicode.org/reports/tr31/proposed.html#Pattern_Syntax> (in yellow highlight) clarify that
> > rule UAX31-R3 <https://unicode.org/reports/tr31/#R3> is applicable to programming languages and present an example illustrating how use of LEFT-TO-RIGHT MARK (LRM) and RIGHT-TO-LEFT MARK (RLM) as whitespace characters (but not in isolation) may be desirable so that source code rendered as plain text does not present the source code in a confusing or surprising manner. The adopted changes suggest (at least) the following items for us to consider:
>
> The most interesting shift in viewpoint that occurs with this update
> is that UAX31-R3 is indeed intended to apply to the lexing of
> programming language source code. The use of "pattern language"
> without a mention of source code in the introductory paragraph was
> not helpful. Mentally replacing "pattern language" with "lexing
> rules" while reading UAX31-R3 helped quite a lot.
>
> > 1. [uaxid.pattern]p2 <http://eel.is/c++draft/uaxid.pattern#2>, as added by P1949 <https://wg21.link/p1949>, states that UAX31-R3 <https://unicode.org/reports/tr31/#R3> is not applicable to C++ but in light of the updates above, that is not correct. The entry should be updated to state our conformance and possibly declare a profile for our use of Pattern_White_Space <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_White_Space%3A%5D&g=&i=> and Pattern_Syntax <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_Syntax%3A%5D&g=&i=> characters.
>
> It seems we need to check that the C++ understanding of whitespace is
> a subset of Pattern_White_Space.
It is a subset (exactly the subset within U+0000 to U+007F, inclusive;
except U+000D CARRIAGE RETURN is not a member of the basic character
set(!)).
Also, whether the requirement means that a language has exactly
Pattern_White_Space as its definition for whitespace is in question.
> We also need to check that the C++
> understanding of characters outside those in identifiers and whitespace
> is a subset of Pattern_Syntax, and document the results in Annex E.
In C++, characters outside of identifiers can include those that may
appear in identifiers (pp-numbers are interesting). The set of
characters outside of identifiers, whitespace, "literals", and
comments are a subset of Pattern_Syntax.
>
> The existing note
>
> "When meeting this requirement, all characters except those that have the
> Pattern_White_Space or Pattern_Syntax properties are available for use as
> identifiers or literals."
>
> is likely not true for C++; there are most likely characters that cannot
> be used in identifiers (because they're not in XID_Start or XID_Continue),
> yet do not match Pattern_White_Space or Pattern_Syntax, either.
>
> > 2. Per the example added to UAX31-R3 <https://unicode.org/reports/tr31/#R3>, consider allowing LRM and RLM to appear in whitespace (this would be an additional change to consider on top of P2348: Whitespaces Wording Revamp <https://wg21.link/p2348> after C++23 pending updated Unicode guidance).
>
> I'm a bit worried about the phrasing "implicit directional marks".
> Are those supposed to appear out of thin air at some point during
> translation? Or is that just some descriptive term without deeper
> meaning?
>
> Is there a readable list of Pattern_Whitespace characters somewhere?
0009..000D ; Pattern_White_Space # Cc [5] <control-0009>..<control-000D>
0020 ; Pattern_White_Space # Zs SPACE
0085 ; Pattern_White_Space # Cc <control-0085>
200E..200F ; Pattern_White_Space # Cf [2] LEFT-TO-RIGHT
MARK..RIGHT-TO-LEFT MARK
2028 ; Pattern_White_Space # Zl LINE SEPARATOR
2029 ; Pattern_White_Space # Zp PARAGRAPH SEPARATOR
>
> > 3. Consider proposing recommended display behaviors to SG15; presumably inline with HL4 from UAX#9 section 4.3, "Higher-Level Protocols" <https://unicode.org/reports/tr9/#Higher-Level_Protocols>. My understanding is that Microsoft Visual Studio implements this behavior. Opportunities for diagnostic improvements can be seen at https://godbolt.org/z/MM1xE5dM1 (note that the carat position is not aligned with the identifier it intends to highlight; this is because the code display and carat location are not in sync with regard to how RTL characters affect presentation).
>
> I'm not sure SG15 has the right people to address this; they
> seem to be mostly focused on build systems and modules these
> days.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
Received on 2022-05-22 21:11:37