Date: Fri, 20 May 2022 23:52:20 -0400
On Fri, May 20, 2022 at 1:18 PM Tom Honermann via SG16
<sg16_at_[hidden]> wrote:
>
> On 5/20/22 12:34 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a telecon on Wednesday, May 25th, at 19:30 UTC (timezone conversion).
>
> The agenda is:
>
> D2572R0: std::format() fill character allowances
>
> Continue review pending the availability of an updated revision.
>
> L2/22-072R: Proposal for amendments to UAX#9 and UAX#31
>
> Review for familiarity and relevance to P1949: C++ Identifier Syntax using Unicode Standard Annex 31.
>
> L2/22-072R was produced by the Unicode Source Code Ad-Hoc Group and adopted in April into the proposed updates for Unicode 15 per the Draft Minutes of UTC Meeting 171. Thanks are owed to Robin Leroy (CC'd) for bringing this paper to our attention. The paper discusses handling of source code that contains characters that have right-to-left (RTL) directionality. The changes made to UAX#9 (Unicode Bidirectional Algorithm) (in yellow highlight) are concerned with presentation of source code and is therefore more of a concern for SG15 (Tooling) where it would be applicable to compilers (e.g., in diagnostics), editors, code review tools, etc... The changes to UAX#31 (Unicode Identifier and Pattern Syntax) (in yellow highlight) clarify that rule UAX31-R3 is applicable to programming languages and present an example illustrating how use of LEFT-TO-RIGHT MARK (LRM) and RIGHT-TO-LEFT MARK (RLM) as whitespace characters (but not in isolation) may be desirable so that source code rendered as plain text does not present the source code in a confusing or surprising manner. The adopted changes suggest (at least) the following items for us to consider:
>
> [uaxid.pattern]p2, as added by P1949, states that UAX31-R3 is not applicable to C++ but in light of the updates above, that is not correct. The entry should be updated to state our conformance and possibly declare a profile for our use of Pattern_White_Space and Pattern_Syntax characters.
> Per the example added to UAX31-R3, consider allowing LRM and RLM to appear in whitespace (this would be an additional change to consider on top of P2348: Whitespaces Wording Revamp after C++23 pending updated Unicode guidance).
I am not convinced that the "as all and only those characters" wording
is good English.
That leads to a question of whether "as all, and as the only,
characters" is meant.
We will need to define a profile to comply anyway (because
line-separating whitespace matters for preprocessing).
That makes me wonder if the Unicode folks gave thoughts to Python and
its semantically-relevant indentation practices.
This section still talks a lot about "pattern languages", and its use
of "literal" is meant as "literal character to be matched".
So, do all of the characters in
1.e+5
need to be considered Pattern_Syntax for the C++ profile? It certainly
seems plausible that all characters subject to restricted "lexical
structure" are meant to be Pattern_Syntax for the purposes of the
requirement.
Note: The characters not in the basic character set and not part of an
identifier won't need to be Pattern_Syntax under the profile. We error
on those outside of string/character literals.
I don't think the update has been done smoothly. The requirement
itself still seems mainly motivated by pattern languages. More
description of how to apply it to other programming languages would be
an improvement to the document.
> Consider proposing recommended display behaviors to SG15; presumably inline with HL4 from UAX#9 section 4.3, "Higher-Level Protocols". My understanding is that Microsoft Visual Studio implements this behavior. Opportunities for diagnostic improvements can be seen at https://godbolt.org/z/MM1xE5dM1 (note that the carat position is not aligned with the identifier it intends to highlight; this is because the code display and carat location are not in sync with regard to how RTL characters affect presentation).
>
> With regard to these last two items, https://godbolt.org/z/vzo996Gnr demonstrates what current compilers do if a LRM is inserted after the undefined identifier. All three compilers reject the LRM, but its presence corrects the source code display such that the carat alignment works as intended.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
<sg16_at_[hidden]> wrote:
>
> On 5/20/22 12:34 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a telecon on Wednesday, May 25th, at 19:30 UTC (timezone conversion).
>
> The agenda is:
>
> D2572R0: std::format() fill character allowances
>
> Continue review pending the availability of an updated revision.
>
> L2/22-072R: Proposal for amendments to UAX#9 and UAX#31
>
> Review for familiarity and relevance to P1949: C++ Identifier Syntax using Unicode Standard Annex 31.
>
> L2/22-072R was produced by the Unicode Source Code Ad-Hoc Group and adopted in April into the proposed updates for Unicode 15 per the Draft Minutes of UTC Meeting 171. Thanks are owed to Robin Leroy (CC'd) for bringing this paper to our attention. The paper discusses handling of source code that contains characters that have right-to-left (RTL) directionality. The changes made to UAX#9 (Unicode Bidirectional Algorithm) (in yellow highlight) are concerned with presentation of source code and is therefore more of a concern for SG15 (Tooling) where it would be applicable to compilers (e.g., in diagnostics), editors, code review tools, etc... The changes to UAX#31 (Unicode Identifier and Pattern Syntax) (in yellow highlight) clarify that rule UAX31-R3 is applicable to programming languages and present an example illustrating how use of LEFT-TO-RIGHT MARK (LRM) and RIGHT-TO-LEFT MARK (RLM) as whitespace characters (but not in isolation) may be desirable so that source code rendered as plain text does not present the source code in a confusing or surprising manner. The adopted changes suggest (at least) the following items for us to consider:
>
> [uaxid.pattern]p2, as added by P1949, states that UAX31-R3 is not applicable to C++ but in light of the updates above, that is not correct. The entry should be updated to state our conformance and possibly declare a profile for our use of Pattern_White_Space and Pattern_Syntax characters.
> Per the example added to UAX31-R3, consider allowing LRM and RLM to appear in whitespace (this would be an additional change to consider on top of P2348: Whitespaces Wording Revamp after C++23 pending updated Unicode guidance).
I am not convinced that the "as all and only those characters" wording
is good English.
That leads to a question of whether "as all, and as the only,
characters" is meant.
We will need to define a profile to comply anyway (because
line-separating whitespace matters for preprocessing).
That makes me wonder if the Unicode folks gave thoughts to Python and
its semantically-relevant indentation practices.
This section still talks a lot about "pattern languages", and its use
of "literal" is meant as "literal character to be matched".
So, do all of the characters in
1.e+5
need to be considered Pattern_Syntax for the C++ profile? It certainly
seems plausible that all characters subject to restricted "lexical
structure" are meant to be Pattern_Syntax for the purposes of the
requirement.
Note: The characters not in the basic character set and not part of an
identifier won't need to be Pattern_Syntax under the profile. We error
on those outside of string/character literals.
I don't think the update has been done smoothly. The requirement
itself still seems mainly motivated by pattern languages. More
description of how to apply it to other programming languages would be
an improvement to the document.
> Consider proposing recommended display behaviors to SG15; presumably inline with HL4 from UAX#9 section 4.3, "Higher-Level Protocols". My understanding is that Microsoft Visual Studio implements this behavior. Opportunities for diagnostic improvements can be seen at https://godbolt.org/z/MM1xE5dM1 (note that the carat position is not aligned with the identifier it intends to highlight; this is because the code display and carat location are not in sync with regard to how RTL characters affect presentation).
>
> With regard to these last two items, https://godbolt.org/z/vzo996Gnr demonstrates what current compilers do if a LRM is inserted after the undefined identifier. All three compilers reject the LRM, but its presence corrects the source code display such that the carat alignment works as intended.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
Received on 2022-05-21 03:52:50