C++ Logo

sg16

Advanced search

[SG16] Draft slides for LWG tomorrow

From: Steve Downey <sdowney_at_[hidden]>
Date: Wed, 23 Sep 2020 13:25:32 -0400
https://github.com/steve-downey/papers/blob/ewg-presentation/UAX31-EWG-slides.org

C++ IDENTIFIERS USING UAX 31

STEVE DOWNEY

Created: 2020-09-23 Wed 13:18

1

TABLE OF CONTENTS

C++ Identifier Syntax using Unicode Standard Annex 31
The Emoji Problem
Script Issues
Other adopters
We have wording
2

C++ IDENTIFIER SYNTAX USING UNICODE STANDARD ANNEX 31

That C++ identifiers match the pattern

(XID_Start + _ ) + XID_Continue*.

That portable source is required to be normalized as NFC.
That using unassigned code points be ill-formed.
3

PROBLEM THIS FIXES : NL 029

Allowed characters include those from U+200b until U+206x; these are
zero-width and control characters that lead to impossible to type
names, indistinguishable names and unusable code & compile errors
(such as those accidentally including RTL modifiers).

4

OTHER "WEIRD IDENTIFIER CODE POINTS"

The middle dot · which looks like an operator.
Many non-combining "modifiers" and accent marks, such as ´ and ¨ and ꓻ
which don't really make sense on their own.
"Tone marks" from various languages, including ˫ (similar to a
box-drawing character ├ which is an operator).
The "Greek question mark" ; (see below)
Symbols which are simply not linguistic, such as ۞ and ༒.

https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59#weird-identifier-code-points

5

UAX 31 - UNICODE IDENTIFIER AND PATTERN SYNTAX

Follows the same principles as originally used for C++
Actively maintained
Stable
6

XID_START AND XID_CONTINUE

Unicode database defined properties
Closed under normalization for all four forms
Once a code point has the property it is never removed
Roughly:

Start == letters
Continue == Start + numbers + some punctuation

7

THE EMOJI PROBLEM

The emoji-like code points that we knew about were excluded
We included all unassigned code points
Emoji 'support' is an accident, incomplete, and broken
8

SOME EXAMPLES

int ⏰ = 0; //not valid
int 🕐 = 0; // valid

int ☠️ = 0; //not valid
int 💀 = 0; // valid

int ✋️ = 0; //not valid
int 👊 = 0; // valid

int ✈️ = 0; //not valid
int 🚀 = 0; // valid

int ☹️ = 0; //not valid
int 😀 = 0; // valid

9

♀ AND ♂ ARE DISALLOWED

// Valid
    bool 👷 = true; // Construction Worker
// Not valid
    bool 👷‍♀ = false; // Woman Construction Worker ({Construction
Worker}{ZWJ}{Female Sign})

10

EMOJI ARE NOT "STABLE" IN UNICODE

>From the emoji spec

isEmoji(♟)=false for Emoji Version 5.0, but true for Version 11.0.

It is possible that the emoji property could be removed.

11

SOME SURPRISING THINGS ARE EMOJI

002A ; Emoji # E0.0 [1] (*️) asterisk
0030..0039 ; Emoji # E0.0 [10] (0️..9️) digit
zero..digit nine

   {DIGIT ONE}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} 1️⃣

   {ASTERISK}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} *️⃣

12

FIXING THE EMOJI PROBLEM WOULD MEAN BEING INVENTIVE

Being inventive in an area outside our expertise is HARD

Adopting UAX31 as a base to move forward is conservative

13

SCRIPT ISSUES

Some scripts require characters to control display or require
punctuation that are not in the identifier set.

14

THIS INCLUDES ENGLISH

Apostrophe and dash

Won't, Can't, Mustn't
Mother-in-law

Programmers are used to this and do not notice
15

ZWJ AND ZWNJ

Zero width joiner and non joiners are used in some scripts

Farsi word "names"

   نامهای
   NOON + ALEF + MEEM + HEH + ALEF + FARSI YEH

Farsi word "a letter"

   نامه‌ای
   NOON + ALEF + MEEM + HEH + ZWNJ + ALEF + FARSI YEH

Anecdotally, these issues are understood and worked around

16

OTHER ADOPTERS

Java (https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html#jls-3.8)
Python 3 https://www.python.org/dev/peps/pep-3131/
Erlang https://www.erlang.org/erlang-enhancement-proposals/eep-0040.html
Rust https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html
JS https://tc39.es/ecma262/
17

WE HAVE WORDING

Core change

identifier:
identifier-nondigit identifier-start
identifier identifier-nondigit identifier-continue
identifier digit

identifier-start:
nondigit
universal-character-name of class XID_Start

identifier-continue:
digit
nondigit
universal-character-name of class XID_Continue
18

Received on 2020-09-23 12:25:49