sg16: Re: [SG16-Unicode] In response to NL029

From: Yehezkel Bernat <yehezkelshb_at_[hidden]>
Date: Sun, 3 Nov 2019 09:39:20 +0200

I'm sorry if this isn't the right place/thread to ask it:
Why do we allow non-ASCII characters in identifiers at all? Wouldn't life
be simpler if identifiers must include only ASCII alphanumeric characters?
I know I assumed it to be the case until lately (when I started reading the
relevant papers here.)

Or maybe Unicode was allowed in the past and now it's too late to change it?

On Sun, Nov 3, 2019 at 1:22 AM Steve Downey <sdowney_at_[hidden]> wrote:

> Will do.
>
> On Sat, Nov 2, 2019, 15:07 Tom Honermann <tom_at_[hidden]> wrote:
>
>> Also, please clarify the document number. I suspect it should be D1949R0
>> (it looks like an extra "1" may have snuck in there).
>>
>> Tom.
>>
>> On 11/2/19 3:05 PM, Tom Honermann wrote:
>>
>> Thanks, Steve. Could you please attach this paper to the SG16 wiki at
>> http://wiki.edg.com/bin/view/Wg21belfast/SG16?
>>
>> Tom.
>>
>> On 11/2/19 9:44 AM, Steve Downey wrote:
>>
>> C++ Identifier Syntax using Unicode Standard Annex 31
>> Document #: D19149R0
>> Date: 2019-11-02
>> Project: Programming Language C++
>> SG16
>> EWG
>> CWG
>> Reply-to: Steve Downey
>> <sdowney_at_[hidden], sdowney2_at_[hidden]>
>> 1 Abstract
>>
>> In response to NL 029 : Disallow zero-width and control characters
>>
>> Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match
>> the pattern (XID_START + _ ) + XID_CONTINUE*. - That portable source is
>> required to be normalized as NFC. - That using unassigned code points
>> ill-formed.
>> 2 Poll before discussion
>>
>> The current state, allowing control characters, ZWJ, and unassigned
>> codepoints in C++ identifiers is not a defect, and is working as designed,
>> and does not need to be addressed
>> 3 Addressing identifiers in a more principled ways
>>
>> UNICODE IDENTIFIER AND PATTERN SYNTAX <https://unicode.org/reports/tr31/> is
>> an attempt to provide a normative way of specifying definitions of
>> general-purpose identifiers for use in programming languages. It has
>> evolved signfigantly over the years, in particular since the time that C++
>> 11 was specified. In particular, the characters that were allowed as
>> identifiers, and the patterns, were not stable at the time of C++11, which
>> is the last time identifiers were addressed in the standard. In addition,
>> at that time, ISO was promulgating advice suggesting a list of code points
>> as the recommended method for ISO standards to specify identifiers.
>>
>> Today the definitions in UAX31 can be used to provide stable definitions
>> for programming language identifiers, with guarantees that an identifier
>> will not be invalidated by later standards.
>>
>> Originally, UAX31 relied on derived properties of characters, ID_START
>> and ID_CONTINUE, however those properties relied on fundamental properties
>> that could change over time. The unicode database now provides XID_START
>> and XID_CONTINUE, based on the same characteristics, but with an additional
>> stability guarantee. The Unicode database now provides explicit
>> classification of both.
>>
>> The original definitions closely match the identifier syntax of C:
>> *Properties*
>> *General Description of Coverage*
>> ID_Start ID_Start characters are derived from the Unicode
>> General_Category of uppercase letters, lowercase letters, titlecase
>> letters, modifier letters, other letters, letter numbers, plus
>> Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.
>>
>> In set notation:
>>
>> [\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>> ID_Continue ID_Continue characters include ID_Start characters, plus
>> characters having the Unicode General_Category of nonspacing marks, spacing
>> combining marks, decimal number, connector punctuation, plus
>> Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code
>> points.
>>
>> In set notation:
>>
>>
>> [\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>
>>
>> The X versions of the properties start the same, but are guaranteed
>> stable in subsequent Unicode standards
>> 4 Issues
>>
>> - Continue does not include ZWJ, which some scripts require
>> - Does not exclude homoglyph attack
>> - Does not require the compiler to normalize identifiers
>> - Does not allow emoji
>>
>> 5 History
>>
>> Using an explicit list of Unicode characters was considered a best
>> practice for ISO standardization in TR 10176:2003 Guidelines for the
>> preparation of programming language standards.
>>
>> National body comment CA 24 for C++11:
>>
>> A list of issues related TR 10176:2003:
>>
>> - “Combining characters should not appear as the first character of
>> an identifier.” Reference: ISO/IEC TR 10176:2003 (Annex A) This is not
>> reflected in FCD.
>> - Restrictions on the first character of an identifier are not
>> observed as recommended in TR 10176:2003. The inclusion of digits (outside
>> of those in the basic character set) under identifer-nondigit is implied by
>> FCD.
>> - It is implied that only the “main listing” from Annex A is included
>> for C++. That is, the list ends with the Special Characters section. This
>> is not made explicit in FCD. Existing practice in C++03 as well as WG 14
>> (C, as of N1425) and WG 4 (COBOL, as of N4315) is to include a list in a
>> normative Annex.
>> - Specify width sensitivity as implied by C++03: is not the same as
>> A. Case sensitivity is already stated in [lex.name].
>>
>> N3146 in 2010-10-04 considered using UAX31, but at the time there were
>> stability issues with identifiers, and came down on the side of explicit
>> white listing.
>>
>> The Unicode standard has since made stability guarantees about
>> identifiers, and created the XID_START and XID_CONTINUE properties to
>> alleviate the stability concerns that existed in 2010.
>> 6 Wording
>>
>> Wording to follow based on SG16 and EWG guidance. There is much prior art
>> to follow based on similar proposals and adoption in Rust and Swift.
>>
>> Explicit universal character names and codepoints are available for
>> particular Unicode standards from the published database, and could be
>> appended as an appendix.
>>
>> _______________________________________________
>> SG16 Unicode mailing listUnicode_at_[hidden]://www.open-std.org/mailman/listinfo/unicode
>>
>>
>>
>> _______________________________________________
>> SG16 Unicode mailing listUnicode_at_[hidden]://www.open-std.org/mailman/listinfo/unicode
>>
>>
>> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-11-03 08:39:38