sg16: Re: [SG16-Unicode] In response to NL029

From: Steve Downey <sdowney_at_[hidden]>
Date: Sat, 2 Nov 2019 19:22:37 -0400

Will do.

On Sat, Nov 2, 2019, 15:07 Tom Honermann <tom_at_[hidden]> wrote:

> Also, please clarify the document number. I suspect it should be D1949R0
> (it looks like an extra "1" may have snuck in there).
>
> Tom.
>
> On 11/2/19 3:05 PM, Tom Honermann wrote:
>
> Thanks, Steve. Could you please attach this paper to the SG16 wiki at
> http://wiki.edg.com/bin/view/Wg21belfast/SG16?
>
> Tom.
>
> On 11/2/19 9:44 AM, Steve Downey wrote:
>
> C++ Identifier Syntax using Unicode Standard Annex 31
> Document #: D19149R0
> Date: 2019-11-02
> Project: Programming Language C++
> SG16
> EWG
> CWG
> Reply-to: Steve Downey
> <sdowney_at_[hidden], sdowney2_at_[hidden]>
> 1 Abstract
>
> In response to NL 029 : Disallow zero-width and control characters
>
> Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match the
> pattern (XID_START + _ ) + XID_CONTINUE*. - That portable source is
> required to be normalized as NFC. - That using unassigned code points
> ill-formed.
> 2 Poll before discussion
>
> The current state, allowing control characters, ZWJ, and unassigned
> codepoints in C++ identifiers is not a defect, and is working as designed,
> and does not need to be addressed
> 3 Addressing identifiers in a more principled ways
>
> UNICODE IDENTIFIER AND PATTERN SYNTAX <https://unicode.org/reports/tr31/> is
> an attempt to provide a normative way of specifying definitions of
> general-purpose identifiers for use in programming languages. It has
> evolved signfigantly over the years, in particular since the time that C++
> 11 was specified. In particular, the characters that were allowed as
> identifiers, and the patterns, were not stable at the time of C++11, which
> is the last time identifiers were addressed in the standard. In addition,
> at that time, ISO was promulgating advice suggesting a list of code points
> as the recommended method for ISO standards to specify identifiers.
>
> Today the definitions in UAX31 can be used to provide stable definitions
> for programming language identifiers, with guarantees that an identifier
> will not be invalidated by later standards.
>
> Originally, UAX31 relied on derived properties of characters, ID_START and
> ID_CONTINUE, however those properties relied on fundamental properties that
> could change over time. The unicode database now provides XID_START and
> XID_CONTINUE, based on the same characteristics, but with an additional
> stability guarantee. The Unicode database now provides explicit
> classification of both.
>
> The original definitions closely match the identifier syntax of C:
> *Properties*
> *General Description of Coverage*
> ID_Start ID_Start characters are derived from the Unicode
> General_Category of uppercase letters, lowercase letters, titlecase
> letters, modifier letters, other letters, letter numbers, plus
> Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.
>
> In set notation:
>
> [\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
> ID_Continue ID_Continue characters include ID_Start characters, plus
> characters having the Unicode General_Category of nonspacing marks, spacing
> combining marks, decimal number, connector punctuation, plus
> Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code
> points.
>
> In set notation:
>
>
> [\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>
>
> The X versions of the properties start the same, but are guaranteed stable
> in subsequent Unicode standards
> 4 Issues
>
> - Continue does not include ZWJ, which some scripts require
> - Does not exclude homoglyph attack
> - Does not require the compiler to normalize identifiers
> - Does not allow emoji
>
> 5 History
>
> Using an explicit list of Unicode characters was considered a best
> practice for ISO standardization in TR 10176:2003 Guidelines for the
> preparation of programming language standards.
>
> National body comment CA 24 for C++11:
>
> A list of issues related TR 10176:2003:
>
> - “Combining characters should not appear as the first character of an
> identifier.” Reference: ISO/IEC TR 10176:2003 (Annex A) This is not
> reflected in FCD.
> - Restrictions on the first character of an identifier are not
> observed as recommended in TR 10176:2003. The inclusion of digits (outside
> of those in the basic character set) under identifer-nondigit is implied by
> FCD.
> - It is implied that only the “main listing” from Annex A is included
> for C++. That is, the list ends with the Special Characters section. This
> is not made explicit in FCD. Existing practice in C++03 as well as WG 14
> (C, as of N1425) and WG 4 (COBOL, as of N4315) is to include a list in a
> normative Annex.
> - Specify width sensitivity as implied by C++03: is not the same as A.
> Case sensitivity is already stated in [lex.name].
>
> N3146 in 2010-10-04 considered using UAX31, but at the time there were
> stability issues with identifiers, and came down on the side of explicit
> white listing.
>
> The Unicode standard has since made stability guarantees about
> identifiers, and created the XID_START and XID_CONTINUE properties to
> alleviate the stability concerns that existed in 2010.
> 6 Wording
>
> Wording to follow based on SG16 and EWG guidance. There is much prior art
> to follow based on similar proposals and adoption in Rust and Swift.
>
> Explicit universal character names and codepoints are available for
> particular Unicode standards from the published database, and could be
> appended as an appendix.
>
> _______________________________________________
> SG16 Unicode mailing listUnicode_at_[hidden]://www.open-std.org/mailman/listinfo/unicode
>
>
>
> _______________________________________________
> SG16 Unicode mailing listUnicode_at_[hidden]://www.open-std.org/mailman/listinfo/unicode
>
>
>

Received on 2019-11-03 00:22:51