C++ Logo

sg16

Advanced search

Re: Agenda for the 2022-05-25 SG16 telecon

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sun, 22 May 2022 00:06:13 -0400
Hi Robin,

Thank you for the response and clarifications. Some responses in-line below.

-- HT

On Sat, May 21, 2022 at 8:32 PM Robin Leroy <eggrobin_at_[hidden]> wrote:
>
> On Pattern_White_Space:
>
>> Is all whitespace expected to be treated equivalently, or is that specifically not assumed?
>
> This is specifically not assumed.
>
> The new note in UAX31-R3 clarifies that, since it explicitly says that you shouldn’t treat LRM equivalently to space where your syntax requires spaces.

Got it. Thanks.

>
> The new wording at the beginning of Section 4 just says
>>
>> Most programming languages have a concept of whitespace as part of their lexical structure
>
>
> Indeed https://eel.is/c++draft/lex#pptoken-2 mentions
>>
>> whitespace characters (U+0020 SPACE, U+0009 CHARACTER TABULATION, new-line, U+000B LINE TABULATION, and U+000C FORM FEED)
>
> and https://docs.python.org/3/reference/lexical_analysis.html?highlight=whitespace#other-tokens mentions
>>
>> Whitespace characters (other than line terminators, discussed earlier)
>
> so both languages appear to have such a concept which includes new lines, even if some of that whitespace (and in particular new lines) is treated specially.
>
>> That makes me wonder if the Unicode folks gave thoughts to Python and its semantically-relevant indentation practices.
>
> I am fairly certain that I had it in mind; the source code working group also has someone from the Python community in regular attendance and actively contributing to the discussions, so rest assured we won’t forget about that language :-)

That's good to know.

>
>> The interplay between this and 'trojan source' is interesting.
>>
>> […]
>>
>> Your deliberately hostile and confusing security attack is now ill-formed is, I think, a reasonable sell.
>
> The issue being mitigated by allowing LRM isn’t a security issue (the author of the code still needs to go out of their way—possibly using an automated tool—to make things render in a readable way, and nothing currently forces an attacker to do so).
> It is a usability issue: these characters make it possible to make bidirectional code readable even in editors that don’t handle that properly. Editors that don’t handle that properly is currently a complicated way to say anything that’s not Visual Studio, but even once other editors start implementing our recommendations, plenty of language-unaware editors will remain—and people will still email and tweet code around.
> Consider:
>>
>> std::vector<מיאו> חתול; // Declares a variable חתול (cat) which is a container of מיאו (meow).
>
> You don’t even need right-to-left identifiers:
>>
>> return u8"مواء"; // رسالة العنصر النائب
>
>
> Both of those can be fixed with LRMs:
>>
>> std::vector<מיאו‎> חתול; // Declares a variable חתול (cat) which is a container of מיאו (meow).
>>
>> return u8"مواء"‎; // رسالة العنصر النائب
>
>
> On Pattern_Syntax:
>
>> I am not convinced that the "as all and only those characters" wording is good English.
>> That leads to a question of whether "as all, and as the only, characters" is meant.
>
> I agree that the wording about Pattern_Syntax is very messy.
> We didn’t want to touch the normative text in that proposal, but we should definitely clarify that in Unicode 16.0 (September 2023).

Looking forward to that; thanks.

> The words
>>
>> characters that are disallowed in identifiers but have syntactic use
>
> were an attempt at a minimalistic clarification in the meantime.
>
> I have checked with Mark Davis (the editor of #31, chair of the source code group, and a co-author on L2/22-072R), we both agree that the interpretation of the Pattern_Syntax side of UAX31-R3 as it applies to programming languages is as follows:
> 1. none of the characters in Pattern_Syntax are allowed in identifiers, and

And it is understood that characters that are allowed in identifiers
are allowed to be used within the syntax of non-identifiers. (Just
confirming).

> 2. the characters outside of the union of Pattern_Syntax, Pattern_White_Space and identifier characters aren’t given special treatment in the lexical structure.

This includes "non-special" treatment of having such "outside"
characters prohibited except within appropriately quoted/comment
contexts?

> As a result, C++ would not need a profile on top of Pattern_Syntax.
>
> This admittedly looks like a fairly trivial requirement for C++.
> However, we have other programming languages in mind as well; in languages that allow user-defined operators, this requirement is nontrivial (it limits what the language would allow in custom operators).
> I believe the characters allowed in user-defined operators in Swift are Pattern_Syntax, so that Swift satisfies this side of R3 (but not the whitespace requirement).
>
> On the broader remarks about UAX31-R3 being a bit lost in a section that talks a lot about pattern syntaxes for historical reasons:
> To quote the rationale of L2/22-072R:
>>
>> UAX#31 defines requirement UAX31-R3 and the usage of Pattern_White_Space as whitespace in the context of “patterns that are a mixture of literal characters, whitespace, and syntax characters”, but, while general programming languages were not the focus when that was defined, the intent was not to limit its applicability so strictly. This is evidenced by the existing note in UAX31-R3, which refers to identifiers: those are not literals, whitespace, nor syntax.
>
> However, the result indeed reads poorly, as we now end up with an extensive note and example illustrating programming language syntaxes in the middle of a section that mostly talks about patterns.
> It has become clear that UAX #31 is in need of significant editorial work; I think we will be doing quite a bit of that for Unicode 16.0.
> I should note however that interactions between UAX31-R3 and the non-pattern parts of the annex are not new: UAX31-R2 (which inspired C++’s earlier definition of identifiers) specifically excludes Pattern_Syntax and Pattern_White_Space from its character set so that it interacts well with UAX31-R3 (that is what the first note in UAX31-R3 was about).
> I should also note that the application of UAX31-R3 to programming languages has precedent: Rust claims conformance to UAX31-R3; so noting this broader applicability is also standardizing existing practice (see also the remark on Swift operators above).

The Rust conformance statement (which seems to be tied to the initial
proposal, https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html,
and not maintained "live") is that it "only uses characters from these
categories for whitespace and syntax". Given the "all and only those
characters" and its likely meaning, the conformance statement does not
seem to indicate actual conformance.

The clarification means that the above statement is correct for
Pattern_Syntax, but the statement would be insufficient for
Pattern_White_Space.

The Rust RFC uses Unicode Version 10.0.0 and its list of whitespace
(https://doc.rust-lang.org/reference/whitespace.html) is, at the time
of writing, not complete based on
https://www.unicode.org/Public/10.0.0/ucd/PropList.txt.

It also seems Rust may want to modify its specification of whitespace
to disallow using sequences of whitespace consisting only of LRM and
RLM as token separators.



>
> Le sam. 21 mai 2022 à 23:40, Steve Downey <sdowney_at_[hidden]> a écrit :
>>
>> I think the impedance mismatch is that although lex and parse are pattern languages, for C++ they are not *Unicode* pattern languages. Except for identifiers we restrict the allowed characters to a subset of the modern portable characters.
>>
>> Given this direction from the Unicode Consortium, we might want to update our conformance note to reflect that our lexing rules are not based in 31, without changing any normative wording.
>>
>> I haven't thought or read deeply about the whitespace changes. The interplay between this and 'trojan source' is interesting. However, I'm also confident that if we don't do anything now, we can fix it later, even if that means technically breaking conforming code. Your deliberately hostile and confusing security attack is now ill-formed is, I think, a reasonable sell.
>>
>> On Sat, May 21, 2022, 13:11 Hubert Tong via SG16 <sg16_at_[hidden]> wrote:
>>>
>>> TL;DR: The motivation for the change is focused on guidance around
>>> whitespace; the original intent of the requirement, and indeed the
>>> whole document, is around stability of the interpretation of
>>> identifiers and pattern-based syntax. The specific guidance about
>>> whitespace that is being added does not fall within the scope
>>> described by the summary of the document. The applicability of the
>>> formulation with Pattern_Syntax to non-pattern-based languages is not
>>> established. The assumed nature of "whitespace" is also not clear.
>>>
>>> On Fri, May 20, 2022 at 11:52 PM Hubert Tong
>>> <hubert.reinterpretcast_at_[hidden]> wrote:
>>> >
>>>
>>> > Note: The characters not in the basic character set and not part of an
>>> > identifier won't need to be Pattern_Syntax under the profile. We error
>>> > on those outside of string/character literals.
>>>
>>> Hmm. I am not sure for such characters what having them as
>>> Pattern_Syntax or not implies.
>>>
>>> I'm pretty sure we don't need them to be Pattern_Syntax in the
>>> profile, but having the ones that are Pattern_Syntax in the UCD remain
>>> Pattern_Syntax is probably right too.
>>>
>>> We treat them all the same anyway, and we already retain the ability
>>> to use them for whatever purpose outside of literals in the future
>>> without breaking currently-conforming programs.
>>>
>>> It's not like we automatically accept the set of characters neither in
>>> Pattern_Syntax and Pattern_White_Space (unlike the assumed behaviour
>>> for pattern languages).
>>>
>>> We already succeed in the original intent of the requirement insofar
>>> as character sets are concerned: Currently-conforming programs would
>>> not change meaning because of future choices to use new characters for
>>> syntactic purposes outside of literals.
>>>
>>> I do not believe the intent of the new change (to apply some
>>> whitespace-related guidance to general programming languages) falls
>>> within the intent of the original requirement (to keep valid patterns
>>> stable). It may be more appropriate to formulate a new, separate
>>> requirement to implement the intent of the new change (which is mostly
>>> restricted to Pattern_White_Space concerns and does not need to bring
>>> Pattern_Syntax into it).
>>>
>>> The document should also make its assumptions about "whitespace"
>>> clear: Is all whitespace expected to be treated equivalently, or is
>>> that specifically not assumed?
>>>
>>> Finally, the update to the document does not make appropriate updates
>>> to reflect expansions to its scope (at the document level in its title
>>> and summary, and in the heading given for the "Pattern Syntax"
>>> section).
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2022-05-22 04:06:43