sg16: [SG16] P1949R4 - C++ Identifier Syntax using Unicode Standard Annex 31

From: Steve Downey <sdowney_at_[hidden]>
Date: Fri, 5 Jun 2020 16:35:31 -0400

Last week SG16 (Text) approved forwarding this paper to EWG for
consideration. It addresses fixing the state of allowed identifiers in C++.

https://isocpp.org/files/papers/P1949R4.html (also attached as d1949.html)

Summary <https://isocpp.org/files/papers/D1949R4.html#summary>

The allowed Unicode code points in identifiers include many that are
unassigned or unnecessary, and others that are actually counter-productive.
By adopting the recommendations of UAX #31, Unicode Identifier and Pattern
Syntax, C++ will be easier to work with in international environments and
less prone to accidental problems.

This proposal does not address some potential security concerns—so called
homoglyph attacks—where letters that appear the same may be treated as
distinct. Methods of defense against such attacks are complex and evolving,
and requiring mitigation strategies would impose substantial implementation
burden.

This proposal also recommends adoption of Unicode normalization form C
(NFC) for identifiers to ensure that when compared, identifiers intended to
be the same will compare as equal. Legacy encodings are generally naturally
in NFC when converted to Unicode. Most tools will, by default, produce NFC
text.

Some unusual scripts require the use of characters as joiners that are not
allowed by UAX #31, these will no longer be available as identifiers in C++.

As a side-effect of adopting the identifier characters from UAX #31, using
emoji in or as identifiers becomes ill-formed.

See also
https://unicode.org/reports/tr31/ Unicode® Standard Annex #31 UNICODE
IDENTIFIER AND PATTERN SYNTAX

Received on 2020-06-05 15:38:56