sg16: Re: [SG16] [isocpp-ext] P1949R4 - C++ Identifier Syntax using Unicode Standard Annex 31

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 18 Jun 2020 10:03:49 -0400

On Thu, Jun 18, 2020 at 9:52 AM Matthew Woehlke via SG16 <
sg16_at_[hidden]> wrote:

> On 05/06/2020 16.35, Steve Downey via Ext wrote:
> > Last week SG16 (Text) approved forwarding this paper to EWG for
> > consideration. It addresses fixing the state of allowed identifiers in
> C++.
> >
> > https://isocpp.org/files/papers/P1949R4.html (also attached as
> d1949.html)
> >
> > Summary <https://isocpp.org/files/papers/D1949R4.html#summary>
> >
> > The allowed Unicode code points in identifiers include many that are
> > unassigned or unnecessary, and others that are actually
> counter-productive.
> > By adopting the recommendations of UAX #31, Unicode Identifier and
> Pattern
> > Syntax, C++ will be easier to work with in international environments and
> > less prone to accidental problems.
> >
> > This proposal does not address some potential security concerns—so called
> > homoglyph attacks—where letters that appear the same may be treated as
> > distinct. Methods of defense against such attacks are complex and
> evolving,
> > and requiring mitigation strategies would impose substantial
> implementation
> > burden.
> >
> > This proposal also recommends adoption of Unicode normalization form C
> > (NFC) for identifiers to ensure that when compared, identifiers intended
> to
> > be the same will compare as equal. Legacy encodings are generally
> naturally
> > in NFC when converted to Unicode. Most tools will, by default, produce
> NFC
> > text.
> >
> > Some unusual scripts require the use of characters as joiners that are
> not
> > allowed by UAX #31, these will no longer be available as identifiers in
> C++.
> >
> > As a side-effect of adopting the identifier characters from UAX #31,
> using
> > emoji in or as identifiers becomes ill-formed.
> >
> > See also
> > https://unicode.org/reports/tr31/ Unicode® Standard Annex #31 UNICODE
> > IDENTIFIER AND PATTERN SYNTAX
>
> Okay... I have a potential strong objection to this. It is not clear to
> me (not being a unicode expert) how this will interact with the many,
> many existing tools (ot to mention programmer muscle memory) that
> defines identifiers as:
>
> [_[:alpha:]][_[:alnum:]]*
>
> In particular, I note that allowed characters include "nonspacing marks,
> spacing combining marks", and "connector punctuation", which don't sound
> like they would be matched by [[:alnum:]].
>
It sounds like the objection is with respect to the status quo.

In Table 2:
FDF0-FE44

That includes U+FE33:
FE33;PRESENTATION FORM FOR VERTICAL LOW LINE;*Pc*;0;ON;<vertical>
005F;;;;N;GLYPH FOR VERTICAL SPACING UNDERSCORE;;;;

*Pc* Connector_Punctuation a connecting punctuation mark, like a tie

Received on 2020-06-18 09:07:18