C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] In response to NL029

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 2 Nov 2019 15:07:39 -0400
Also, please clarify the document number. I suspect it should be
D1949R0 (it looks like an extra "1" may have snuck in there).

Tom.

On 11/2/19 3:05 PM, Tom Honermann wrote:
> Thanks, Steve. Could you please attach this paper to the SG16 wiki at
> http://wiki.edg.com/bin/view/Wg21belfast/SG16?
>
> Tom.
>
> On 11/2/19 9:44 AM, Steve Downey wrote:
>>
>>
>> C++ Identifier Syntax using Unicode Standard Annex 31
>>
>> Document #: D19149R0
>> Date: 2019-11-02
>> Project: Programming Language C++
>> SG16
>> EWG
>> CWG
>> Reply-to: Steve Downey
>> <sdowney_at_[hidden] <mailto:sdowney_at_[hidden]>, sdowney2_at_[hidden]
>> <mailto:sdowney2_at_[hidden]>>
>>
>>
>> 1 Abstract
>>
>> In response to NL 029 : Disallow zero-width and control characters
>>
>> Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers
>> match the pattern (XID_START + _ ) + XID_CONTINUE*. - That portable
>> source is required to be normalized as NFC. - That using unassigned
>> code points ill-formed.
>>
>>
>> 2 Poll before discussion
>>
>> The current state, allowing control characters, ZWJ, and unassigned
>> codepoints in C++ identifiers is not a defect, and is working as
>> designed, and does not need to be addressed
>>
>>
>> 3 Addressing identifiers in a more principled ways
>>
>> UNICODE IDENTIFIER AND PATTERN SYNTAX
>> <https://unicode.org/reports/tr31/> is an attempt to provide a
>> normative way of specifying definitions of general-purpose
>> identifiers for use in programming languages. It has evolved
>> signfigantly over the years, in particular since the time that C++ 11
>> was specified. In particular, the characters that were allowed as
>> identifiers, and the patterns, were not stable at the time of C++11,
>> which is the last time identifiers were addressed in the standard. In
>> addition, at that time, ISO was promulgating advice suggesting a list
>> of code points as the recommended method for ISO standards to specify
>> identifiers.
>>
>> Today the definitions in UAX31 can be used to provide stable
>> definitions for programming language identifiers, with guarantees
>> that an identifier will not be invalidated by later standards.
>>
>> Originally, UAX31 relied on derived properties of characters,
>> ID_START and ID_CONTINUE, however those properties relied on
>> fundamental properties that could change over time. The unicode
>> database now provides XID_START and XID_CONTINUE, based on the same
>> characteristics, but with an additional stability guarantee. The
>> Unicode database now provides explicit classification of both.
>>
>> The original definitions closely match the identifier syntax of C:
>>
>> *Properties*
>>
>> *General Description of Coverage*
>> ID_Start ID_Start characters are derived from the Unicode
>> General_Category of uppercase letters, lowercase letters, titlecase
>> letters, modifier letters, other letters, letter numbers, plus
>> Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code
>> points.
>>
>> In set notation:
>>
>> [\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>> ID_Continue ID_Continue characters include ID_Start characters, plus
>> characters having the Unicode General_Category of nonspacing marks,
>> spacing combining marks, decimal number, connector punctuation, plus
>> Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code
>> points.
>>
>> In set notation:
>>
>> [\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>
>>
>>
>> The X versions of the properties start the same, but are guaranteed
>> stable in subsequent Unicode standards
>>
>>
>> 4 Issues
>>
>> * Continue does not include ZWJ, which some scripts require
>> * Does not exclude homoglyph attack
>> * Does not require the compiler to normalize identifiers
>> * Does not allow emoji
>>
>>
>> 5 History
>>
>> Using an explicit list of Unicode characters was considered a best
>> practice for ISO standardization in TR 10176:2003 Guidelines for the
>> preparation of programming language standards.
>>
>> National body comment CA 24 for C++11:
>>
>> A list of issues related TR 10176:2003:
>>
>> * “Combining characters should not appear as the first
>> character of an identifier.” Reference: ISO/IEC TR 10176:2003
>> (Annex A) This is not reflected in FCD.
>> * Restrictions on the first character of an identifier are not
>> observed as recommended in TR 10176:2003. The inclusion of
>> digits (outside of those in the basic character set) under
>> identifer-nondigit is implied by FCD.
>> * It is implied that only the “main listing” from Annex A is
>> included for C++. That is, the list ends with the Special
>> Characters section. This is not made explicit in FCD.
>> Existing practice in C++03 as well as WG 14 (C, as of N1425)
>> and WG 4 (COBOL, as of N4315) is to include a list in a
>> normative Annex.
>> * Specify width sensitivity as implied by C++03: is not the
>> same as A. Case sensitivity is already stated in [lex.name
>> <http://lex.name>].
>>
>> N3146 in 2010-10-04 considered using UAX31, but at the time there
>> were stability issues with identifiers, and came down on the side of
>> explicit white listing.
>>
>> The Unicode standard has since made stability guarantees about
>> identifiers, and created the XID_START and XID_CONTINUE properties to
>> alleviate the stability concerns that existed in 2010.
>>
>>
>> 6 Wording
>>
>> Wording to follow based on SG16 and EWG guidance. There is much prior
>> art to follow based on similar proposals and adoption in Rust and Swift.
>>
>> Explicit universal character names and codepoints are available for
>> particular Unicode standards from the published database, and could
>> be appended as an appendix.
>>
>>
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode



Received on 2019-11-02 20:07:48