sg16: Re: [SG16-Unicode] [isocpp-core] Fwd: New Core Issue: [lex.name]/3.2 under-specifies "uppercase letter"

From: Mathias Stearn <redbeard0531+isocpp_at_[hidden]>
Date: Mon, 28 Oct 2019 16:25:54 -0400

On Mon, Oct 28, 2019 at 12:58 PM Richard Smith <richardsmith_at_[hidden]>
wrote:

> On Mon, Oct 28, 2019 at 9:39 AM Mathias Stearn via Core <
> core_at_[hidden]> wrote:
>
>> Is it just uppercase letters in the basic source character set, or
>> anything considered an uppercase letter in the universal character set
>> after phase 1 transcoding and universal-character-name resolution? Or is
>> there some other definition of uppercase?
>>
>
> My interpretation:
>
> * We don't resolve universal-character-names; rather, we *form* them. (Eg,
> int façade; is converted into int fa\u00e7ade;) So for example _Ç becomes
> _\u00c7, which doesn't start with an underscore followed by an uppercase
> letter (it's an underscore followed by a slash).
>

I considered that but it felt like an overly legalistic reading at the
time. It also seems to be counter to http://eel.is/c++draft/lex.name#1. On
the other hand, that first sentence "An identifier is an arbitrarily long
sequence of letters and digits." is clearly incorrect because many of the
allowed code points (including all emoji) are neither letters nor digits.

It also seems vaguely counter to my reading of the "spirit" of
http://eel.is/c++draft/lex.phases#1.1.sentence-4, but I have no idea what
the normative impact of that sentence is. (I hope compilers internal
encoding choices are not observable...)

I guess [lex] needs some cleanup in general.

> * Unicode (to which we have a normative reference) defines uppercase, and
> we follow that, but we happen to only ever apply it to the basic source
> character set because of the above rewriting.
>
>
>> I have a slight preference for restricting to just A-Z so that it doesn't
>> require humans or tools to consult the unicode data tables to decide if an
>> identifier is safe to use.
>>
>
> Regardless of how we express the rule, I agree with this direction.
>
> Proposed resolution:
>>
>> Replace [lex.names]/3.2 with:
>>
>> Each identifier that contains a double underscore __ or begins with an
>> underscore followed by an uppercase <del>letter</del><ins>*nondigit*</ins>
>> is reserved to the implementation for any use.
>>
>
> ... and I think this is a fine wording improvement, whether or not we
> think it's formally necessary.
>
>
>> Alternatively we could either create a new grammar production for
>> uppercase *nondigit*s, or just say something like "one of the universal
>> characters in the range 0041-005A (A-Z)"
>>
>>
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post: http://lists.isocpp.org/core/2019/10/7541.php
>>
>

Received on 2019-10-28 21:26:07