sg16: Re: [SG16-Unicode] [isocpp-core] Fwd: New Core Issue: [lex.name]/3.2 under-specifies "uppercase letter"

From: JF Bastien <cxx_at_[hidden]>
Date: Mon, 28 Oct 2019 13:37:19 -0700

On Mon, Oct 28, 2019 at 1:26 PM Mathias Stearn <
redbeard0531+isocpp_at_[hidden]> wrote:

>
>
> On Mon, Oct 28, 2019 at 12:58 PM Richard Smith <richardsmith_at_[hidden]>
> wrote:
>
>> On Mon, Oct 28, 2019 at 9:39 AM Mathias Stearn via Core <
>> core_at_[hidden]> wrote:
>>
>>> Is it just uppercase letters in the basic source character set, or
>>> anything considered an uppercase letter in the universal character set
>>> after phase 1 transcoding and universal-character-name resolution? Or is
>>> there some other definition of uppercase?
>>>
>>
>> My interpretation:
>>
>> * We don't resolve universal-character-names; rather, we *form* them.
>> (Eg, int façade; is converted into int fa\u00e7ade;) So for example _Ç
>> becomes _\u00c7, which doesn't start with an underscore followed by an
>> uppercase letter (it's an underscore followed by a slash).
>>
>
> I considered that but it felt like an overly legalistic reading at the
> time. It also seems to be counter to http://eel.is/c++draft/lex.name#1.
> On the other hand, that first sentence "An identifier is an arbitrarily
> long sequence of letters and digits." is clearly incorrect because many of
> the allowed code points (including all emoji) are neither letters nor
> digits.
>
> It also seems vaguely counter to my reading of the "spirit" of
> http://eel.is/c++draft/lex.phases#1.1.sentence-4, but I have no idea what
> the normative impact of that sentence is. (I hope compilers internal
> encoding choices are not observable...)
>
> I guess [lex] needs some cleanup in general.
>

Details like these are why we really should address
https://github.com/sg16-unicode/sg16/issues/48
instead of doing point solutions for every single issue.

> * Unicode (to which we have a normative reference) defines uppercase, and
>> we follow that, but we happen to only ever apply it to the basic source
>> character set because of the above rewriting.
>>
>>
>>> I have a slight preference for restricting to just A-Z so that it
>>> doesn't require humans or tools to consult the unicode data tables to
>>> decide if an identifier is safe to use.
>>>
>>
>> Regardless of how we express the rule, I agree with this direction.
>>
>> Proposed resolution:
>>>
>>> Replace [lex.names]/3.2 with:
>>>
>>> Each identifier that contains a double underscore __ or begins with an
>>> underscore followed by an uppercase <del>letter</del><ins>*nondigit*</ins>
>>> is reserved to the implementation for any use.
>>>
>>
>> ... and I think this is a fine wording improvement, whether or not we
>> think it's formally necessary.
>>
>>
>>> Alternatively we could either create a new grammar production for
>>> uppercase *nondigit*s, or just say something like "one of the universal
>>> characters in the range 0041-005A (A-Z)"
>>>
>>>
>>> _______________________________________________
>>> Core mailing list
>>> Core_at_[hidden]
>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>> Link to this post: http://lists.isocpp.org/core/2019/10/7541.php
>>>
>> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-10-28 21:37:33