Date: Sun, 22 May 2022 08:24:32 +0200
On 20/05/2022 18.34, Tom Honermann via SG16 wrote:
> * L2/22-072R: Proposal for amendments to UAX#9 and UAX#31 <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf>
> o Review for familiarity and relevance to P1949: C++ Identifier Syntax using Unicode Standard Annex 31 <https://wg21.link/p1949>.
>
> L2/22-072R <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf> was produced by the Unicode Source Code Ad-Hoc Group and adopted in April into the proposed updates for Unicode 15 per the Draft Minutes of UTC Meeting 171 <https://www.unicode.org/L2/L2022/22061.htm#171-C25>. Thanks are owed to Robin Leroy (CC'd) for bringing this paper to our attention. The paper discusses handling of source code that contains characters that have right-to-left (RTL) directionality. The changes made to UAX#9 (Unicode Bidirectional Algorithm) <https://www.unicode.org/reports/tr9/proposed.html#HL4Example2> (in yellow highlight) are concerned with presentation of source code and is therefore more of a concern for SG15 (Tooling) where it would be applicable to compilers (e.g., in diagnostics), editors, code review tools, etc... The changes to UAX#31 (Unicode Identifier and Pattern Syntax) <https://www.unicode.org/reports/tr31/proposed.html#Pattern_Syntax> (in yellow highlight) clarify that
> rule UAX31-R3 <https://unicode.org/reports/tr31/#R3> is applicable to programming languages and present an example illustrating how use of LEFT-TO-RIGHT MARK (LRM) and RIGHT-TO-LEFT MARK (RLM) as whitespace characters (but not in isolation) may be desirable so that source code rendered as plain text does not present the source code in a confusing or surprising manner. The adopted changes suggest (at least) the following items for us to consider:
The most interesting shift in viewpoint that occurs with this update
is that UAX31-R3 is indeed intended to apply to the lexing of
programming language source code. The use of "pattern language"
without a mention of source code in the introductory paragraph was
not helpful. Mentally replacing "pattern language" with "lexing
rules" while reading UAX31-R3 helped quite a lot.
> 1. [uaxid.pattern]p2 <http://eel.is/c++draft/uaxid.pattern#2>, as added by P1949 <https://wg21.link/p1949>, states that UAX31-R3 <https://unicode.org/reports/tr31/#R3> is not applicable to C++ but in light of the updates above, that is not correct. The entry should be updated to state our conformance and possibly declare a profile for our use of Pattern_White_Space <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_White_Space%3A%5D&g=&i=> and Pattern_Syntax <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_Syntax%3A%5D&g=&i=> characters.
It seems we need to check that the C++ understanding of whitespace is
a subset of Pattern_White_Space. We also need to check that the C++
understanding of characters outside those in identifiers and whitespace
is a subset of Pattern_Syntax, and document the results in Annex E.
The existing note
"When meeting this requirement, all characters except those that have the
Pattern_White_Space or Pattern_Syntax properties are available for use as
identifiers or literals."
is likely not true for C++; there are most likely characters that cannot
be used in identifiers (because they're not in XID_Start or XID_Continue),
yet do not match Pattern_White_Space or Pattern_Syntax, either.
> 2. Per the example added to UAX31-R3 <https://unicode.org/reports/tr31/#R3>, consider allowing LRM and RLM to appear in whitespace (this would be an additional change to consider on top of P2348: Whitespaces Wording Revamp <https://wg21.link/p2348> after C++23 pending updated Unicode guidance).
I'm a bit worried about the phrasing "implicit directional marks".
Are those supposed to appear out of thin air at some point during
translation? Or is that just some descriptive term without deeper
meaning?
Is there a readable list of Pattern_Whitespace characters somewhere?
> 3. Consider proposing recommended display behaviors to SG15; presumably inline with HL4 from UAX#9 section 4.3, "Higher-Level Protocols" <https://unicode.org/reports/tr9/#Higher-Level_Protocols>. My understanding is that Microsoft Visual Studio implements this behavior. Opportunities for diagnostic improvements can be seen at https://godbolt.org/z/MM1xE5dM1 (note that the carat position is not aligned with the identifier it intends to highlight; this is because the code display and carat location are not in sync with regard to how RTL characters affect presentation).
I'm not sure SG15 has the right people to address this; they
seem to be mostly focused on build systems and modules these
days.
Jens
> * L2/22-072R: Proposal for amendments to UAX#9 and UAX#31 <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf>
> o Review for familiarity and relevance to P1949: C++ Identifier Syntax using Unicode Standard Annex 31 <https://wg21.link/p1949>.
>
> L2/22-072R <https://www.unicode.org/L2/L2022/22072r-uax9-uax31-amd.pdf> was produced by the Unicode Source Code Ad-Hoc Group and adopted in April into the proposed updates for Unicode 15 per the Draft Minutes of UTC Meeting 171 <https://www.unicode.org/L2/L2022/22061.htm#171-C25>. Thanks are owed to Robin Leroy (CC'd) for bringing this paper to our attention. The paper discusses handling of source code that contains characters that have right-to-left (RTL) directionality. The changes made to UAX#9 (Unicode Bidirectional Algorithm) <https://www.unicode.org/reports/tr9/proposed.html#HL4Example2> (in yellow highlight) are concerned with presentation of source code and is therefore more of a concern for SG15 (Tooling) where it would be applicable to compilers (e.g., in diagnostics), editors, code review tools, etc... The changes to UAX#31 (Unicode Identifier and Pattern Syntax) <https://www.unicode.org/reports/tr31/proposed.html#Pattern_Syntax> (in yellow highlight) clarify that
> rule UAX31-R3 <https://unicode.org/reports/tr31/#R3> is applicable to programming languages and present an example illustrating how use of LEFT-TO-RIGHT MARK (LRM) and RIGHT-TO-LEFT MARK (RLM) as whitespace characters (but not in isolation) may be desirable so that source code rendered as plain text does not present the source code in a confusing or surprising manner. The adopted changes suggest (at least) the following items for us to consider:
The most interesting shift in viewpoint that occurs with this update
is that UAX31-R3 is indeed intended to apply to the lexing of
programming language source code. The use of "pattern language"
without a mention of source code in the introductory paragraph was
not helpful. Mentally replacing "pattern language" with "lexing
rules" while reading UAX31-R3 helped quite a lot.
> 1. [uaxid.pattern]p2 <http://eel.is/c++draft/uaxid.pattern#2>, as added by P1949 <https://wg21.link/p1949>, states that UAX31-R3 <https://unicode.org/reports/tr31/#R3> is not applicable to C++ but in light of the updates above, that is not correct. The entry should be updated to state our conformance and possibly declare a profile for our use of Pattern_White_Space <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_White_Space%3A%5D&g=&i=> and Pattern_Syntax <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3APattern_Syntax%3A%5D&g=&i=> characters.
It seems we need to check that the C++ understanding of whitespace is
a subset of Pattern_White_Space. We also need to check that the C++
understanding of characters outside those in identifiers and whitespace
is a subset of Pattern_Syntax, and document the results in Annex E.
The existing note
"When meeting this requirement, all characters except those that have the
Pattern_White_Space or Pattern_Syntax properties are available for use as
identifiers or literals."
is likely not true for C++; there are most likely characters that cannot
be used in identifiers (because they're not in XID_Start or XID_Continue),
yet do not match Pattern_White_Space or Pattern_Syntax, either.
> 2. Per the example added to UAX31-R3 <https://unicode.org/reports/tr31/#R3>, consider allowing LRM and RLM to appear in whitespace (this would be an additional change to consider on top of P2348: Whitespaces Wording Revamp <https://wg21.link/p2348> after C++23 pending updated Unicode guidance).
I'm a bit worried about the phrasing "implicit directional marks".
Are those supposed to appear out of thin air at some point during
translation? Or is that just some descriptive term without deeper
meaning?
Is there a readable list of Pattern_Whitespace characters somewhere?
> 3. Consider proposing recommended display behaviors to SG15; presumably inline with HL4 from UAX#9 section 4.3, "Higher-Level Protocols" <https://unicode.org/reports/tr9/#Higher-Level_Protocols>. My understanding is that Microsoft Visual Studio implements this behavior. Opportunities for diagnostic improvements can be seen at https://godbolt.org/z/MM1xE5dM1 (note that the carat position is not aligned with the identifier it intends to highlight; this is because the code display and carat location are not in sync with regard to how RTL characters affect presentation).
I'm not sure SG15 has the right people to address this; they
seem to be mostly focused on build systems and modules these
days.
Jens
Received on 2022-05-22 06:24:39