sg16: Re: [SG16-Unicode] In response to NL029

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 2 Nov 2019 15:05:42 -0400

Thanks, Steve. Could you please attach this paper to the SG16 wiki at
http://wiki.edg.com/bin/view/Wg21belfast/SG16?

Tom.

On 11/2/19 9:44 AM, Steve Downey wrote:
>
>
> C++ Identifier Syntax using Unicode Standard Annex 31
>
> Document #: D19149R0
> Date: 2019-11-02
> Project: Programming Language C++
> SG16
> EWG
> CWG
> Reply-to: Steve Downey
> <sdowney_at_[hidden] <mailto:sdowney_at_[hidden]>, sdowney2_at_[hidden]
> <mailto:sdowney2_at_[hidden]>>
>
>
> 1 Abstract
>
> In response to NL 029 : Disallow zero-width and control characters
>
> Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match
> the pattern (XID_START + _ ) + XID_CONTINUE*. - That portable source
> is required to be normalized as NFC. - That using unassigned code
> points ill-formed.
>
>
> 2 Poll before discussion
>
> The current state, allowing control characters, ZWJ, and unassigned
> codepoints in C++ identifiers is not a defect, and is working as
> designed, and does not need to be addressed
>
>
> 3 Addressing identifiers in a more principled ways
>
> UNICODE IDENTIFIER AND PATTERN SYNTAX
> <https://unicode.org/reports/tr31/> is an attempt to provide a
> normative way of specifying definitions of general-purpose identifiers
> for use in programming languages. It has evolved signfigantly over the
> years, in particular since the time that C++ 11 was specified. In
> particular, the characters that were allowed as identifiers, and the
> patterns, were not stable at the time of C++11, which is the last time
> identifiers were addressed in the standard. In addition, at that time,
> ISO was promulgating advice suggesting a list of code points as the
> recommended method for ISO standards to specify identifiers.
>
> Today the definitions in UAX31 can be used to provide stable
> definitions for programming language identifiers, with guarantees that
> an identifier will not be invalidated by later standards.
>
> Originally, UAX31 relied on derived properties of characters, ID_START
> and ID_CONTINUE, however those properties relied on fundamental
> properties that could change over time. The unicode database now
> provides XID_START and XID_CONTINUE, based on the same
> characteristics, but with an additional stability guarantee. The
> Unicode database now provides explicit classification of both.
>
> The original definitions closely match the identifier syntax of C:
>
> *Properties*
>
> *General Description of Coverage*
> ID_Start ID_Start characters are derived from the Unicode
> General_Category of uppercase letters, lowercase letters, titlecase
> letters, modifier letters, other letters, letter numbers, plus
> Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.
>
> In set notation:
>
> [\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
> ID_Continue ID_Continue characters include ID_Start characters, plus
> characters having the Unicode General_Category of nonspacing marks,
> spacing combining marks, decimal number, connector punctuation, plus
> Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code
> points.
>
> In set notation:
>
> [\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>
>
>
> The X versions of the properties start the same, but are guaranteed
> stable in subsequent Unicode standards
>
>
> 4 Issues
>
> * Continue does not include ZWJ, which some scripts require
> * Does not exclude homoglyph attack
> * Does not require the compiler to normalize identifiers
> * Does not allow emoji
>
>
> 5 History
>
> Using an explicit list of Unicode characters was considered a best
> practice for ISO standardization in TR 10176:2003 Guidelines for the
> preparation of programming language standards.
>
> National body comment CA 24 for C++11:
>
> A list of issues related TR 10176:2003:
>
> * “Combining characters should not appear as the first character
> of an identifier.” Reference: ISO/IEC TR 10176:2003 (Annex A)
> This is not reflected in FCD.
> * Restrictions on the first character of an identifier are not
> observed as recommended in TR 10176:2003. The inclusion of
> digits (outside of those in the basic character set) under
> identifer-nondigit is implied by FCD.
> * It is implied that only the “main listing” from Annex A is
> included for C++. That is, the list ends with the Special
> Characters section. This is not made explicit in FCD. Existing
> practice in C++03 as well as WG 14 (C, as of N1425) and WG 4
> (COBOL, as of N4315) is to include a list in a normative Annex.
> * Specify width sensitivity as implied by C++03: is not the same
> as A. Case sensitivity is already stated in [lex.name
> <http://lex.name>].
>
> N3146 in 2010-10-04 considered using UAX31, but at the time there were
> stability issues with identifiers, and came down on the side of
> explicit white listing.
>
> The Unicode standard has since made stability guarantees about
> identifiers, and created the XID_START and XID_CONTINUE properties to
> alleviate the stability concerns that existed in 2010.
>
>
> 6 Wording
>
> Wording to follow based on SG16 and EWG guidance. There is much prior
> art to follow based on similar proposals and adoption in Rust and Swift.
>
> Explicit universal character names and codepoints are available for
> particular Unicode standards from the published database, and could be
> appended as an appendix.
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-11-02 20:05:47