Date: Tue, 29 Oct 2019 07:20:48 -0400
I'm bringing a late paper to Belfast that will propose adopting UAX31 in
its simplest form. Identifiers as XID_START + _ followed by XID_CONTINUE.
Portable source required to be NFC. Using unassigned code points
ill-formed.
That would be mean no control characters embedded in identifiers, and also
no emoji. That's in addition to a paper proposing that the wording around
character sets and encodings be modernized.
There are some implications for reflection, too, as we will have to deal
with translation from internal representation to something in a portable
way that does not lose fidelity, as narrow string literals may not support
the full range.
On Mon, Oct 28, 2019, 16:37 JF Bastien <cxx_at_[hidden]> wrote:
>
>
> On Mon, Oct 28, 2019 at 1:26 PM Mathias Stearn <
> redbeard0531+isocpp_at_[hidden]> wrote:
>
>>
>>
>> On Mon, Oct 28, 2019 at 12:58 PM Richard Smith <richardsmith_at_[hidden]>
>> wrote:
>>
>>> On Mon, Oct 28, 2019 at 9:39 AM Mathias Stearn via Core <
>>> core_at_[hidden]> wrote:
>>>
>>>> Is it just uppercase letters in the basic source character set, or
>>>> anything considered an uppercase letter in the universal character set
>>>> after phase 1 transcoding and universal-character-name resolution? Or is
>>>> there some other definition of uppercase?
>>>>
>>>
>>> My interpretation:
>>>
>>> * We don't resolve universal-character-names; rather, we *form* them.
>>> (Eg, int façade; is converted into int fa\u00e7ade;) So for example _Ç
>>> becomes _\u00c7, which doesn't start with an underscore followed by an
>>> uppercase letter (it's an underscore followed by a slash).
>>>
>>
>> I considered that but it felt like an overly legalistic reading at the
>> time. It also seems to be counter to http://eel.is/c++draft/lex.name#1.
>> On the other hand, that first sentence "An identifier is an arbitrarily
>> long sequence of letters and digits." is clearly incorrect because many of
>> the allowed code points (including all emoji) are neither letters nor
>> digits.
>>
>> It also seems vaguely counter to my reading of the "spirit" of
>> http://eel.is/c++draft/lex.phases#1.1.sentence-4, but I have no idea
>> what the normative impact of that sentence is. (I hope compilers internal
>> encoding choices are not observable...)
>>
>> I guess [lex] needs some cleanup in general.
>>
>
> Details like these are why we really should address
> https://github.com/sg16-unicode/sg16/issues/48
> instead of doing point solutions for every single issue.
>
>
>
>> * Unicode (to which we have a normative reference) defines uppercase, and
>>> we follow that, but we happen to only ever apply it to the basic source
>>> character set because of the above rewriting.
>>>
>>>
>>>> I have a slight preference for restricting to just A-Z so that it
>>>> doesn't require humans or tools to consult the unicode data tables to
>>>> decide if an identifier is safe to use.
>>>>
>>>
>>> Regardless of how we express the rule, I agree with this direction.
>>>
>>> Proposed resolution:
>>>>
>>>> Replace [lex.names]/3.2 with:
>>>>
>>>> Each identifier that contains a double underscore __ or begins with an
>>>> underscore followed by an uppercase <del>letter</del><ins>*nondigit*</ins>
>>>> is reserved to the implementation for any use.
>>>>
>>>
>>> ... and I think this is a fine wording improvement, whether or not we
>>> think it's formally necessary.
>>>
>>>
>>>> Alternatively we could either create a new grammar production for
>>>> uppercase *nondigit*s, or just say something like "one of the
>>>> universal characters in the range 0041-005A (A-Z)"
>>>>
>>>>
>>>> _______________________________________________
>>>> Core mailing list
>>>> Core_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>>> Link to this post: http://lists.isocpp.org/core/2019/10/7541.php
>>>>
>>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
its simplest form. Identifiers as XID_START + _ followed by XID_CONTINUE.
Portable source required to be NFC. Using unassigned code points
ill-formed.
That would be mean no control characters embedded in identifiers, and also
no emoji. That's in addition to a paper proposing that the wording around
character sets and encodings be modernized.
There are some implications for reflection, too, as we will have to deal
with translation from internal representation to something in a portable
way that does not lose fidelity, as narrow string literals may not support
the full range.
On Mon, Oct 28, 2019, 16:37 JF Bastien <cxx_at_[hidden]> wrote:
>
>
> On Mon, Oct 28, 2019 at 1:26 PM Mathias Stearn <
> redbeard0531+isocpp_at_[hidden]> wrote:
>
>>
>>
>> On Mon, Oct 28, 2019 at 12:58 PM Richard Smith <richardsmith_at_[hidden]>
>> wrote:
>>
>>> On Mon, Oct 28, 2019 at 9:39 AM Mathias Stearn via Core <
>>> core_at_[hidden]> wrote:
>>>
>>>> Is it just uppercase letters in the basic source character set, or
>>>> anything considered an uppercase letter in the universal character set
>>>> after phase 1 transcoding and universal-character-name resolution? Or is
>>>> there some other definition of uppercase?
>>>>
>>>
>>> My interpretation:
>>>
>>> * We don't resolve universal-character-names; rather, we *form* them.
>>> (Eg, int façade; is converted into int fa\u00e7ade;) So for example _Ç
>>> becomes _\u00c7, which doesn't start with an underscore followed by an
>>> uppercase letter (it's an underscore followed by a slash).
>>>
>>
>> I considered that but it felt like an overly legalistic reading at the
>> time. It also seems to be counter to http://eel.is/c++draft/lex.name#1.
>> On the other hand, that first sentence "An identifier is an arbitrarily
>> long sequence of letters and digits." is clearly incorrect because many of
>> the allowed code points (including all emoji) are neither letters nor
>> digits.
>>
>> It also seems vaguely counter to my reading of the "spirit" of
>> http://eel.is/c++draft/lex.phases#1.1.sentence-4, but I have no idea
>> what the normative impact of that sentence is. (I hope compilers internal
>> encoding choices are not observable...)
>>
>> I guess [lex] needs some cleanup in general.
>>
>
> Details like these are why we really should address
> https://github.com/sg16-unicode/sg16/issues/48
> instead of doing point solutions for every single issue.
>
>
>
>> * Unicode (to which we have a normative reference) defines uppercase, and
>>> we follow that, but we happen to only ever apply it to the basic source
>>> character set because of the above rewriting.
>>>
>>>
>>>> I have a slight preference for restricting to just A-Z so that it
>>>> doesn't require humans or tools to consult the unicode data tables to
>>>> decide if an identifier is safe to use.
>>>>
>>>
>>> Regardless of how we express the rule, I agree with this direction.
>>>
>>> Proposed resolution:
>>>>
>>>> Replace [lex.names]/3.2 with:
>>>>
>>>> Each identifier that contains a double underscore __ or begins with an
>>>> underscore followed by an uppercase <del>letter</del><ins>*nondigit*</ins>
>>>> is reserved to the implementation for any use.
>>>>
>>>
>>> ... and I think this is a fine wording improvement, whether or not we
>>> think it's formally necessary.
>>>
>>>
>>>> Alternatively we could either create a new grammar production for
>>>> uppercase *nondigit*s, or just say something like "one of the
>>>> universal characters in the range 0041-005A (A-Z)"
>>>>
>>>>
>>>> _______________________________________________
>>>> Core mailing list
>>>> Core_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>>> Link to this post: http://lists.isocpp.org/core/2019/10/7541.php
>>>>
>>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
Received on 2019-10-29 12:21:03