Date: Wed, 23 Sep 2020 12:44:56 -0500
Looks good. Did you mean EWG, though?
Zach
On Wed, Sep 23, 2020 at 12:25 PM Steve Downey via SG16 <
sg16_at_[hidden]> wrote:
>
> https://github.com/steve-downey/papers/blob/ewg-presentation/UAX31-EWG-slides.org
>
> C++ IDENTIFIERS USING UAX 31
>
> STEVE DOWNEY
>
> Created: 2020-09-23 Wed 13:18
>
> 1
>
> TABLE OF CONTENTS
>
> C++ Identifier Syntax using Unicode Standard Annex 31
> The Emoji Problem
> Script Issues
> Other adopters
> We have wording
> 2
>
> C++ IDENTIFIER SYNTAX USING UNICODE STANDARD ANNEX 31
>
> That C++ identifiers match the pattern
>
> (XID_Start + _ ) + XID_Continue*.
>
> That portable source is required to be normalized as NFC.
> That using unassigned code points be ill-formed.
> 3
>
> PROBLEM THIS FIXES : NL 029
>
> Allowed characters include those from U+200b until U+206x; these are
> zero-width and control characters that lead to impossible to type
> names, indistinguishable names and unusable code & compile errors
> (such as those accidentally including RTL modifiers).
>
> 4
>
> OTHER "WEIRD IDENTIFIER CODE POINTS"
>
> The middle dot · which looks like an operator.
> Many non-combining "modifiers" and accent marks, such as ´ and ¨ and ꓻ
> which don't really make sense on their own.
> "Tone marks" from various languages, including ˫ (similar to a
> box-drawing character ├ which is an operator).
> The "Greek question mark" ; (see below)
> Symbols which are simply not linguistic, such as ۞ and ༒.
>
>
> https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59#weird-identifier-code-points
>
> 5
>
> UAX 31 - UNICODE IDENTIFIER AND PATTERN SYNTAX
>
> Follows the same principles as originally used for C++
> Actively maintained
> Stable
> 6
>
> XID_START AND XID_CONTINUE
>
> Unicode database defined properties
> Closed under normalization for all four forms
> Once a code point has the property it is never removed
> Roughly:
>
> Start == letters
> Continue == Start + numbers + some punctuation
>
> 7
>
> THE EMOJI PROBLEM
>
> The emoji-like code points that we knew about were excluded
> We included all unassigned code points
> Emoji 'support' is an accident, incomplete, and broken
> 8
>
> SOME EXAMPLES
>
> int ⏰ = 0; //not valid
> int 🕐 = 0; // valid
>
> int ☠️ = 0; //not valid
> int 💀 = 0; // valid
>
> int ✋️ = 0; //not valid
> int 👊 = 0; // valid
>
> int ✈️ = 0; //not valid
> int 🚀 = 0; // valid
>
> int ☹️ = 0; //not valid
> int 😀 = 0; // valid
>
> 9
>
> ♀ AND ♂ ARE DISALLOWED
>
> // Valid
> bool 👷 = true; // Construction Worker
> // Not valid
> bool 👷♀ = false; // Woman Construction Worker ({Construction
> Worker}{ZWJ}{Female Sign})
>
> 10
>
> EMOJI ARE NOT "STABLE" IN UNICODE
>
> From the emoji spec
>
> isEmoji(♟)=false for Emoji Version 5.0, but true for Version 11.0.
>
> It is possible that the emoji property could be removed.
>
> 11
>
> SOME SURPRISING THINGS ARE EMOJI
>
> 002A ; Emoji # E0.0 [1] (*️) asterisk
> 0030..0039 ; Emoji # E0.0 [10] (0️..9️) digit
> zero..digit nine
>
> {DIGIT ONE}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} 1️⃣
>
> {ASTERISK}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} *️⃣
>
> 12
>
> FIXING THE EMOJI PROBLEM WOULD MEAN BEING INVENTIVE
>
> Being inventive in an area outside our expertise is HARD
>
> Adopting UAX31 as a base to move forward is conservative
>
> 13
>
> SCRIPT ISSUES
>
> Some scripts require characters to control display or require
> punctuation that are not in the identifier set.
>
> 14
>
> THIS INCLUDES ENGLISH
>
> Apostrophe and dash
>
> Won't, Can't, Mustn't
> Mother-in-law
>
> Programmers are used to this and do not notice
> 15
>
> ZWJ AND ZWNJ
>
> Zero width joiner and non joiners are used in some scripts
>
> Farsi word "names"
>
> نامهای
> NOON + ALEF + MEEM + HEH + ALEF + FARSI YEH
>
> Farsi word "a letter"
>
> نامهای
> NOON + ALEF + MEEM + HEH + ZWNJ + ALEF + FARSI YEH
>
> Anecdotally, these issues are understood and worked around
>
> 16
>
> OTHER ADOPTERS
>
> Java (
> https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html#jls-3.8)
> Python 3 https://www.python.org/dev/peps/pep-3131/
> Erlang https://www.erlang.org/erlang-enhancement-proposals/eep-0040.html
> Rust https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html
> JS https://tc39.es/ecma262/
> 17
>
> WE HAVE WORDING
>
> Core change
>
> identifier:
> identifier-nondigit identifier-start
> identifier identifier-nondigit identifier-continue
> identifier digit
>
> identifier-start:
> nondigit
> universal-character-name of class XID_Start
>
> identifier-continue:
> digit
> nondigit
> universal-character-name of class XID_Continue
> 18
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Zach
On Wed, Sep 23, 2020 at 12:25 PM Steve Downey via SG16 <
sg16_at_[hidden]> wrote:
>
> https://github.com/steve-downey/papers/blob/ewg-presentation/UAX31-EWG-slides.org
>
> C++ IDENTIFIERS USING UAX 31
>
> STEVE DOWNEY
>
> Created: 2020-09-23 Wed 13:18
>
> 1
>
> TABLE OF CONTENTS
>
> C++ Identifier Syntax using Unicode Standard Annex 31
> The Emoji Problem
> Script Issues
> Other adopters
> We have wording
> 2
>
> C++ IDENTIFIER SYNTAX USING UNICODE STANDARD ANNEX 31
>
> That C++ identifiers match the pattern
>
> (XID_Start + _ ) + XID_Continue*.
>
> That portable source is required to be normalized as NFC.
> That using unassigned code points be ill-formed.
> 3
>
> PROBLEM THIS FIXES : NL 029
>
> Allowed characters include those from U+200b until U+206x; these are
> zero-width and control characters that lead to impossible to type
> names, indistinguishable names and unusable code & compile errors
> (such as those accidentally including RTL modifiers).
>
> 4
>
> OTHER "WEIRD IDENTIFIER CODE POINTS"
>
> The middle dot · which looks like an operator.
> Many non-combining "modifiers" and accent marks, such as ´ and ¨ and ꓻ
> which don't really make sense on their own.
> "Tone marks" from various languages, including ˫ (similar to a
> box-drawing character ├ which is an operator).
> The "Greek question mark" ; (see below)
> Symbols which are simply not linguistic, such as ۞ and ༒.
>
>
> https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59#weird-identifier-code-points
>
> 5
>
> UAX 31 - UNICODE IDENTIFIER AND PATTERN SYNTAX
>
> Follows the same principles as originally used for C++
> Actively maintained
> Stable
> 6
>
> XID_START AND XID_CONTINUE
>
> Unicode database defined properties
> Closed under normalization for all four forms
> Once a code point has the property it is never removed
> Roughly:
>
> Start == letters
> Continue == Start + numbers + some punctuation
>
> 7
>
> THE EMOJI PROBLEM
>
> The emoji-like code points that we knew about were excluded
> We included all unassigned code points
> Emoji 'support' is an accident, incomplete, and broken
> 8
>
> SOME EXAMPLES
>
> int ⏰ = 0; //not valid
> int 🕐 = 0; // valid
>
> int ☠️ = 0; //not valid
> int 💀 = 0; // valid
>
> int ✋️ = 0; //not valid
> int 👊 = 0; // valid
>
> int ✈️ = 0; //not valid
> int 🚀 = 0; // valid
>
> int ☹️ = 0; //not valid
> int 😀 = 0; // valid
>
> 9
>
> ♀ AND ♂ ARE DISALLOWED
>
> // Valid
> bool 👷 = true; // Construction Worker
> // Not valid
> bool 👷♀ = false; // Woman Construction Worker ({Construction
> Worker}{ZWJ}{Female Sign})
>
> 10
>
> EMOJI ARE NOT "STABLE" IN UNICODE
>
> From the emoji spec
>
> isEmoji(♟)=false for Emoji Version 5.0, but true for Version 11.0.
>
> It is possible that the emoji property could be removed.
>
> 11
>
> SOME SURPRISING THINGS ARE EMOJI
>
> 002A ; Emoji # E0.0 [1] (*️) asterisk
> 0030..0039 ; Emoji # E0.0 [10] (0️..9️) digit
> zero..digit nine
>
> {DIGIT ONE}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} 1️⃣
>
> {ASTERISK}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} *️⃣
>
> 12
>
> FIXING THE EMOJI PROBLEM WOULD MEAN BEING INVENTIVE
>
> Being inventive in an area outside our expertise is HARD
>
> Adopting UAX31 as a base to move forward is conservative
>
> 13
>
> SCRIPT ISSUES
>
> Some scripts require characters to control display or require
> punctuation that are not in the identifier set.
>
> 14
>
> THIS INCLUDES ENGLISH
>
> Apostrophe and dash
>
> Won't, Can't, Mustn't
> Mother-in-law
>
> Programmers are used to this and do not notice
> 15
>
> ZWJ AND ZWNJ
>
> Zero width joiner and non joiners are used in some scripts
>
> Farsi word "names"
>
> نامهای
> NOON + ALEF + MEEM + HEH + ALEF + FARSI YEH
>
> Farsi word "a letter"
>
> نامهای
> NOON + ALEF + MEEM + HEH + ZWNJ + ALEF + FARSI YEH
>
> Anecdotally, these issues are understood and worked around
>
> 16
>
> OTHER ADOPTERS
>
> Java (
> https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html#jls-3.8)
> Python 3 https://www.python.org/dev/peps/pep-3131/
> Erlang https://www.erlang.org/erlang-enhancement-proposals/eep-0040.html
> Rust https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html
> JS https://tc39.es/ecma262/
> 17
>
> WE HAVE WORDING
>
> Core change
>
> identifier:
> identifier-nondigit identifier-start
> identifier identifier-nondigit identifier-continue
> identifier digit
>
> identifier-start:
> nondigit
> universal-character-name of class XID_Start
>
> identifier-continue:
> digit
> nondigit
> universal-character-name of class XID_Continue
> 18
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2020-09-23 12:45:16