ISOCPP sg16 List: Re: Agenda for the 2022-05-25 SG16 telecon

From: Robin Leroy <eggrobin_at_[hidden]>
Date: Sun, 22 May 2022 02:32:39 +0200

On Pattern_White_Space:

Is all whitespace expected to be treated equivalently, or is that
> specifically not assumed?
>
This is specifically not assumed.

The new note in UAX31-R3
<https://www.unicode.org/reports/tr31/proposed.html#R3> clarifies that,
since it explicitly says that you shouldn’t treat LRM equivalently to space
where your syntax requires spaces.

The new wording at the beginning of Section 4
<https://www.unicode.org/reports/tr31/proposed.html#Pattern_Syntax> just
says

> Most programming languages have a concept of whitespace as part of their
> lexical structure

Indeed https://eel.is/c++draft/lex#pptoken-2 mentions

> whitespace characters (U+0020 SPACE, U+0009 CHARACTER TABULATION,
> new-line, U+000B LINE TABULATION, and U+000C FORM FEED)

and
https://docs.python.org/3/reference/lexical_analysis.html?highlight=whitespace#other-tokens
mentions

> Whitespace characters (other than line terminators, discussed earlier)
>
so both languages appear to have such a concept which includes new lines,
even if some of that whitespace (and in particular new lines) is treated
specially.

That makes me wonder if the Unicode folks gave thoughts to Python and its
> semantically-relevant indentation practices.

I am fairly certain that I had it in mind; the source code working group
also has someone from the Python community in regular attendance and
actively contributing to the discussions, so rest assured we won’t forget
about that language :-)

The interplay between this and 'trojan source' is interesting.
>
[…]

Your deliberately hostile and confusing security attack is now ill-formed
> is, I think, a reasonable sell.

The issue being mitigated by allowing LRM isn’t a security issue (the
author of the code still needs to go out of their way—possibly using an
automated tool—to make things render in a readable way, and nothing
currently forces an attacker to do so).
It is a usability issue: these characters make it possible to make
bidirectional code readable even in editors that don’t handle that
properly. *Editors that don’t handle that properly* is currently a
complicated way to say *anything that’s not Visual Studio*, but even once
other editors start implementing our recommendations, plenty of
language-unaware editors will remain—and people will still email and tweet
code around.
Consider:

> std::vector<מיאו> חתול; // Declares a variable חתול (cat) which is a
> container of מיאו (meow).

You don’t even need right-to-left identifiers:

> *return* u8"مواء"; // رسالة العنصر النائب

Both of those can be fixed with LRMs:

> std::vector<מיאו‎> חתול; // Declares a variable חתול (cat) which is a
> container of מיאו (meow).

*return* u8"مواء"‎; // رسالة العنصر النائب

On Pattern_Syntax:

I am not convinced that the "as all and only those characters" wording is
> good English.
> That leads to a question of whether "as all, and as the only, characters"
> is meant.

I agree that the wording about Pattern_Syntax is very messy.
We didn’t want to touch the normative text in that proposal, but we should
definitely clarify that in Unicode 16.0 (September 2023).
The words

> characters that are disallowed in identifiers but have syntactic use
>
were an attempt at a minimalistic clarification in the meantime.

I have checked with Mark Davis (the editor of #31, chair of the source code
group, and a co-author on L2/22-072R), we both agree that the
interpretation of the Pattern_Syntax side of UAX31-R3 as it applies to
programming languages is as follows:
1. none of the characters in Pattern_Syntax are allowed in identifiers, and
2. the characters outside of the union of Pattern_Syntax,
Pattern_White_Space and identifier characters aren’t given special
treatment in the lexical structure.
As a result, C++ would not need a profile on top of Pattern_Syntax.

This admittedly looks like a fairly trivial requirement for C++.
However, we have other programming languages in mind as well; in languages
that allow user-defined operators, this requirement is nontrivial (it
limits what the language would allow in custom operators).
I believe the characters allowed in user-defined operators in Swift are
Pattern_Syntax, so that Swift satisfies this side of R3 (but not the
whitespace requirement).

On the broader remarks about UAX31-R3 being a bit lost in a section that
talks a lot about pattern syntaxes for historical reasons:
To quote the rationale of L2/22-072R:

> UAX#31 defines requirement UAX31-R3 and the usage of Pattern_White_Space
> as whitespace in the context of “patterns that are a mixture of literal
> characters, whitespace, and syntax characters”, but, while general
> programming languages were not the focus when that was defined, the intent
> was not to limit its applicability so strictly. This is evidenced by the
> existing note in UAX31-R3, which refers to identifiers: those are not
> literals, whitespace, nor syntax.

However, the result indeed reads poorly, as we now end up with an extensive
note and example illustrating programming language syntaxes in the middle
of a section that mostly talks about patterns.
It has become clear that UAX #31 is in need of significant editorial work;
I think we will be doing quite a bit of that for Unicode 16.0.
I should note however that interactions between UAX31-R3 and the
non-pattern parts of the annex are not new: UAX31-R2 (which inspired
C++’s earlier definition of identifiers) specifically excludes
Pattern_Syntax and Pattern_White_Space from its character set so that it
interacts well with UAX31-R3 (that is what the first note in UAX31-R3 was
about).
I should also note that the application of UAX31-R3 to programming
languages has precedent: Rust claims conformance
<https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html#conformance-statement>
to UAX31-R3; so noting this broader applicability is also standardizing
existing practice (see also the remark on Swift operators above).

Le sam. 21 mai 2022 à 23:40, Steve Downey <sdowney_at_[hidden]> a écrit :

> I think the impedance mismatch is that although lex and parse are pattern
> languages, for C++ they are not *Unicode* pattern languages. Except for
> identifiers we restrict the allowed characters to a subset of the modern
> portable characters.
>
> Given this direction from the Unicode Consortium, we might want to update
> our conformance note to reflect that our lexing rules are not based in 31,
> without changing any normative wording.
>
> I haven't thought or read deeply about the whitespace changes. The
> interplay between this and 'trojan source' is interesting. However, I'm
> also confident that if we don't do anything now, we can fix it later, even
> if that means technically breaking conforming code. Your deliberately
> hostile and confusing security attack is now ill-formed is, I think, a
> reasonable sell.
>
> On Sat, May 21, 2022, 13:11 Hubert Tong via SG16 <sg16_at_[hidden]>
> wrote:
>
>> TL;DR: The motivation for the change is focused on guidance around
>> whitespace; the original intent of the requirement, and indeed the
>> whole document, is around stability of the interpretation of
>> identifiers and pattern-based syntax. The specific guidance about
>> whitespace that is being added does not fall within the scope
>> described by the summary of the document. The applicability of the
>> formulation with Pattern_Syntax to non-pattern-based languages is not
>> established. The assumed nature of "whitespace" is also not clear.
>>
>> On Fri, May 20, 2022 at 11:52 PM Hubert Tong
>> <hubert.reinterpretcast_at_[hidden]> wrote:
>> >
>>
>> > Note: The characters not in the basic character set and not part of an
>> > identifier won't need to be Pattern_Syntax under the profile. We error
>> > on those outside of string/character literals.
>>
>> Hmm. I am not sure for such characters what having them as
>> Pattern_Syntax or not implies.
>>
>> I'm pretty sure we don't need them to be Pattern_Syntax in the
>> profile, but having the ones that are Pattern_Syntax in the UCD remain
>> Pattern_Syntax is probably right too.
>>
>> We treat them all the same anyway, and we already retain the ability
>> to use them for whatever purpose outside of literals in the future
>> without breaking currently-conforming programs.
>>
>> It's not like we automatically accept the set of characters neither in
>> Pattern_Syntax and Pattern_White_Space (unlike the assumed behaviour
>> for pattern languages).
>>
>> We already succeed in the original intent of the requirement insofar
>> as character sets are concerned: Currently-conforming programs would
>> not change meaning because of future choices to use new characters for
>> syntactic purposes outside of literals.
>>
>> I do not believe the intent of the new change (to apply some
>> whitespace-related guidance to general programming languages) falls
>> within the intent of the original requirement (to keep valid patterns
>> stable). It may be more appropriate to formulate a new, separate
>> requirement to implement the intent of the new change (which is mostly
>> restricted to Pattern_White_Space concerns and does not need to bring
>> Pattern_Syntax into it).
>>
>> The document should also make its assumptions about "whitespace"
>> clear: Is all whitespace expected to be treated equivalently, or is
>> that specifically not assumed?
>>
>> Finally, the update to the document does not make appropriate updates
>> to reflect expansions to its scope (at the document level in its title
>> and summary, and in the heading given for the "Pattern Syntax"
>> section).
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2022-05-22 00:32:52