On Mon, Oct 28, 2019 at 1:26 PM Mathias Stearn <redbeard0531+isocpp@gmail.com> wrote:

On Mon, Oct 28, 2019 at 12:58 PM Richard Smith <richardsmith@google.com> wrote:
On Mon, Oct 28, 2019 at 9:39 AM Mathias Stearn via Core <core@lists.isocpp.org> wrote:
Is it just uppercase letters in the basic source character set, or anything considered an uppercase letter in the universal character set after phase 1 transcoding and universal-character-name resolution? Or is there some other definition of uppercase?

My interpretation:

* We don't resolve universal-character-names; rather, we *form* them. (Eg, int façade; is converted into int fa\u00e7ade;) So for example _Ç becomes _\u00c7, which doesn't start with an underscore followed by an uppercase letter (it's an underscore followed by a slash).

I considered that but it felt like an overly legalistic reading at the time. It also seems to be counter to http://eel.is/c++draft/lex.name#1. On the other hand, that first sentence "An identifier is an arbitrarily long sequence of letters and digits." is clearly incorrect because many of the allowed code points (including all emoji) are neither letters nor digits.

It also seems vaguely counter to my reading of the "spirit" of http://eel.is/c++draft/lex.phases#1.1.sentence-4, but I have no idea what the normative impact of that sentence is. (I hope compilers internal encoding choices are not observable...)

I guess [lex] needs some cleanup in general.

Details like these are why we really should address https://github.com/sg16-unicode/sg16/issues/48

instead of doing point solutions for every single issue.

* Unicode (to which we have a normative reference) defines uppercase, and we follow that, but we happen to only ever apply it to the basic source character set because of the above rewriting.

I have a slight preference for restricting to just A-Z so that it doesn't require humans or tools to consult the unicode data tables to decide if an identifier is safe to use.

Regardless of how we express the rule, I agree with this direction.

Proposed resolution:

Replace [lex.names]/3.2 with:

Each identifier that contains a double underscore __ or begins with an underscore followed by an uppercase <del>letter</del><ins>nondigit</ins> is reserved to the implementation for any use.

... and I think this is a fine wording improvement, whether or not we think it's formally necessary.

Alternatively we could either create a new grammar production for uppercase nondigits, or just say something like "one of the universal characters in the range 0041-005A (A-Z)"

_______________________________________________
Core mailing list
Core@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
Link to this post: http://lists.isocpp.org/core/2019/10/7541.php

_______________________________________________
SG16 Unicode mailing list
Unicode@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/unicode