Subject: Re: [isocpp-ext] P1949R4 - C++ Identifier Syntax using Unicode Standard Annex 31
From: Hubert Tong (hubert.reinterpretcast_at_[hidden])
Date: 2020-06-18 09:03:49
On Thu, Jun 18, 2020 at 9:52 AM Matthew Woehlke via SG16 <
> On 05/06/2020 16.35, Steve Downey via Ext wrote:
> > Last week SG16 (Text) approved forwarding this paper to EWG for
> > consideration. It addresses fixing the state of allowed identifiers in
> > https://isocpp.org/files/papers/P1949R4.html (also attached as
> > Summary <https://isocpp.org/files/papers/D1949R4.html#summary>
> > The allowed Unicode code points in identifiers include many that are
> > unassigned or unnecessary, and others that are actually
> > By adopting the recommendations of UAX #31, Unicode Identifier and
> > Syntax, C++ will be easier to work with in international environments and
> > less prone to accidental problems.
> > This proposal does not address some potential security concernsâso called
> > homoglyph attacksâwhere letters that appear the same may be treated as
> > distinct. Methods of defense against such attacks are complex and
> > and requiring mitigation strategies would impose substantial
> > burden.
> > This proposal also recommends adoption of Unicode normalization form C
> > (NFC) for identifiers to ensure that when compared, identifiers intended
> > be the same will compare as equal. Legacy encodings are generally
> > in NFC when converted to Unicode. Most tools will, by default, produce
> > text.
> > Some unusual scripts require the use of characters as joiners that are
> > allowed by UAX #31, these will no longer be available as identifiers in
> > As a side-effect of adopting the identifier characters from UAX #31,
> > emoji in or as identifiers becomes ill-formed.
> > See also
> > https://unicode.org/reports/tr31/ UnicodeÂ® Standard Annex #31 UNICODE
> > IDENTIFIER AND PATTERN SYNTAX
> Okay... I have a potential strong objection to this. It is not clear to
> me (not being a unicode expert) how this will interact with the many,
> many existing tools (ot to mention programmer muscle memory) that
> defines identifiers as:
> In particular, I note that allowed characters include "nonspacing marks,
> spacing combining marks", and "connector punctuation", which don't sound
> like they would be matched by [[:alnum:]].
It sounds like the objection is with respect to the status quo.
In Table 2:
That includes U+FE33:
FE33;PRESENTATION FORM FOR VERTICAL LOW LINE;*Pc*;0;ON;<vertical>
005F;;;;N;GLYPH FOR VERTICAL SPACING UNDERSCORE;;;;
*Pc* Connector_Punctuation a connecting punctuation mark, like a tie
SG16 list run by email@example.com