sg16: Re: [SG16] [isocpp-ext] P1949R4 - C++ Identifier Syntax using Unicode Standard Annex 31

From: Matthew Woehlke <mwoehlke.floss_at_[hidden]>
Date: Thu, 18 Jun 2020 09:51:48 -0400

On 05/06/2020 16.35, Steve Downey via Ext wrote:
> Last week SG16 (Text) approved forwarding this paper to EWG for
> consideration. It addresses fixing the state of allowed identifiers in C++.
>
> https://isocpp.org/files/papers/P1949R4.html (also attached as d1949.html)
>
> Summary <https://isocpp.org/files/papers/D1949R4.html#summary>
>
> The allowed Unicode code points in identifiers include many that are
> unassigned or unnecessary, and others that are actually counter-productive.
> By adopting the recommendations of UAX #31, Unicode Identifier and Pattern
> Syntax, C++ will be easier to work with in international environments and
> less prone to accidental problems.
>
> This proposal does not address some potential security concerns—so called
> homoglyph attacks—where letters that appear the same may be treated as
> distinct. Methods of defense against such attacks are complex and evolving,
> and requiring mitigation strategies would impose substantial implementation
> burden.
>
> This proposal also recommends adoption of Unicode normalization form C
> (NFC) for identifiers to ensure that when compared, identifiers intended to
> be the same will compare as equal. Legacy encodings are generally naturally
> in NFC when converted to Unicode. Most tools will, by default, produce NFC
> text.
>
> Some unusual scripts require the use of characters as joiners that are not
> allowed by UAX #31, these will no longer be available as identifiers in C++.
>
> As a side-effect of adopting the identifier characters from UAX #31, using
> emoji in or as identifiers becomes ill-formed.
>
> See also
> https://unicode.org/reports/tr31/ Unicode® Standard Annex #31 UNICODE
> IDENTIFIER AND PATTERN SYNTAX

Okay... I have a potential strong objection to this. It is not clear to
me (not being a unicode expert) how this will interact with the many,
many existing tools (ot to mention programmer muscle memory) that
defines identifiers as:

[_[:alpha:]][_[:alnum:]]*

In particular, I note that allowed characters include "nonspacing marks,
spacing combining marks", and "connector punctuation", which don't sound
like they would be matched by [[:alnum:]].

I would very, ***VERY*** strongly like to see an analysis of whether
this change is going to break existing tools that rely on the above
definition of identifiers.

Note that I *do* expect "𝕰𝖛𝖔𝖑𝖚𝖙𝖎𝖔𝖓" to match the above
specification and will happily consider it a bug in the tool if it does not.

I will also happily argue that we should continue to disallow
punctuation in identifiers, even if notionally required by some scripts.
After all, we don't currently allow:

int can'tNameThis;

(That said, I would hope and expect compilers that are less strict will
continue to offer that as an option. It seems this would be required
even with the paper in its current state.)

-- 
Matthew

Received on 2020-06-18 08:55:00