sg16: Re: [SG16-Unicode] In response to NL029

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sun, 3 Nov 2019 09:32:21 +0100

On Sun, Nov 3, 2019, 08:39 Yehezkel Bernat <yehezkelshb_at_[hidden]> wrote:

> I'm sorry if this isn't the right place/thread to ask it:
> Why do we allow non-ASCII characters in identifiers at all? Wouldn't life
> be simpler if identifiers must include only ASCII alphanumeric characters?
> I know I assumed it to be the case until lately (when I started reading
> the relevant papers here.)
>
> Or maybe Unicode was allowed in the past and now it's too late to change
> it?
>

I think implementers do support it/want to support it.
But they don't necessarily do it right and definitely not consistently so I
personally think it's better to specify how to do it to ensure portability
and intoropability with other features such as reflections.

I do think using that feature needs to be done carefully but there are
certainly use cases for it.

>
> On Sun, Nov 3, 2019 at 1:22 AM Steve Downey <sdowney_at_[hidden]> wrote:
>
>> Will do.
>>
>> On Sat, Nov 2, 2019, 15:07 Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> Also, please clarify the document number. I suspect it should be
>>> D1949R0 (it looks like an extra "1" may have snuck in there).
>>>
>>> Tom.
>>>
>>> On 11/2/19 3:05 PM, Tom Honermann wrote:
>>>
>>> Thanks, Steve. Could you please attach this paper to the SG16 wiki at
>>> http://wiki.edg.com/bin/view/Wg21belfast/SG16?
>>>
>>> Tom.
>>>
>>> On 11/2/19 9:44 AM, Steve Downey wrote:
>>>
>>> C++ Identifier Syntax using Unicode Standard Annex 31
>>> Document #: D19149R0
>>> Date: 2019-11-02
>>> Project: Programming Language C++
>>> SG16
>>> EWG
>>> CWG
>>> Reply-to: Steve Downey
>>> <sdowney_at_[hidden], sdowney2_at_[hidden]>
>>> 1 Abstract
>>>
>>> In response to NL 029 : Disallow zero-width and control characters
>>>
>>> Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match
>>> the pattern (XID_START + _ ) + XID_CONTINUE*. - That portable source is
>>> required to be normalized as NFC. - That using unassigned code points
>>> ill-formed.
>>> 2 Poll before discussion
>>>
>>> The current state, allowing control characters, ZWJ, and unassigned
>>> codepoints in C++ identifiers is not a defect, and is working as designed,
>>> and does not need to be addressed
>>> 3 Addressing identifiers in a more principled ways
>>>
>>> UNICODE IDENTIFIER AND PATTERN SYNTAX
>>> <https://unicode.org/reports/tr31/> is an attempt to provide a
>>> normative way of specifying definitions of general-purpose identifiers for
>>> use in programming languages. It has evolved signfigantly over the years,
>>> in particular since the time that C++ 11 was specified. In particular, the
>>> characters that were allowed as identifiers, and the patterns, were not
>>> stable at the time of C++11, which is the last time identifiers were
>>> addressed in the standard. In addition, at that time, ISO was promulgating
>>> advice suggesting a list of code points as the recommended method for ISO
>>> standards to specify identifiers.
>>>
>>> Today the definitions in UAX31 can be used to provide stable definitions
>>> for programming language identifiers, with guarantees that an identifier
>>> will not be invalidated by later standards.
>>>
>>> Originally, UAX31 relied on derived properties of characters, ID_START
>>> and ID_CONTINUE, however those properties relied on fundamental properties
>>> that could change over time. The unicode database now provides XID_START
>>> and XID_CONTINUE, based on the same characteristics, but with an additional
>>> stability guarantee. The Unicode database now provides explicit
>>> classification of both.
>>>
>>> The original definitions closely match the identifier syntax of C:
>>> *Properties*
>>> *General Description of Coverage*
>>> ID_Start ID_Start characters are derived from the Unicode
>>> General_Category of uppercase letters, lowercase letters, titlecase
>>> letters, modifier letters, other letters, letter numbers, plus
>>> Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.
>>>
>>> In set notation:
>>>
>>> [\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>> ID_Continue ID_Continue characters include ID_Start characters, plus
>>> characters having the Unicode General_Category of nonspacing marks, spacing
>>> combining marks, decimal number, connector punctuation, plus
>>> Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code
>>> points.
>>>
>>> In set notation:
>>>
>>>
>>> [\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>>
>>>
>>> The X versions of the properties start the same, but are guaranteed
>>> stable in subsequent Unicode standards
>>> 4 Issues
>>>
>>> - Continue does not include ZWJ, which some scripts require
>>> - Does not exclude homoglyph attack
>>> - Does not require the compiler to normalize identifiers
>>> - Does not allow emoji
>>>
>>> 5 History
>>>
>>> Using an explicit list of Unicode characters was considered a best
>>> practice for ISO standardization in TR 10176:2003 Guidelines for the
>>> preparation of programming language standards.
>>>
>>> National body comment CA 24 for C++11:
>>>
>>> A list of issues related TR 10176:2003:
>>>
>>> - “Combining characters should not appear as the first character of
>>> an identifier.” Reference: ISO/IEC TR 10176:2003 (Annex A) This is not
>>> reflected in FCD.
>>> - Restrictions on the first character of an identifier are not
>>> observed as recommended in TR 10176:2003. The inclusion of digits (outside
>>> of those in the basic character set) under identifer-nondigit is implied by
>>> FCD.
>>> - It is implied that only the “main listing” from Annex A is
>>> included for C++. That is, the list ends with the Special Characters
>>> section. This is not made explicit in FCD. Existing practice in C++03 as
>>> well as WG 14 (C, as of N1425) and WG 4 (COBOL, as of N4315) is to include
>>> a list in a normative Annex.
>>> - Specify width sensitivity as implied by C++03: is not the same as
>>> A. Case sensitivity is already stated in [lex.name].
>>>
>>> N3146 in 2010-10-04 considered using UAX31, but at the time there were
>>> stability issues with identifiers, and came down on the side of explicit
>>> white listing.
>>>
>>> The Unicode standard has since made stability guarantees about
>>> identifiers, and created the XID_START and XID_CONTINUE properties to
>>> alleviate the stability concerns that existed in 2010.
>>> 6 Wording
>>>
>>> Wording to follow based on SG16 and EWG guidance. There is much prior
>>> art to follow based on similar proposals and adoption in Rust and Swift.
>>>
>>> Explicit universal character names and codepoints are available for
>>> particular Unicode standards from the published database, and could be
>>> appended as an appendix.
>>>
>>> _______________________________________________
>>> SG16 Unicode mailing listUnicode_at_[hidden]://www.open-std.org/mailman/listinfo/unicode
>>>
>>>
>>>
>>> _______________________________________________
>>> SG16 Unicode mailing listUnicode_at_[hidden]://www.open-std.org/mailman/listinfo/unicode
>>>
>>>
>>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-11-03 09:32:42