Date: Sun, 10 Jul 2022 18:33:31 -0400
On 7/10/22 11:30 AM, Jens Maurer wrote:
> On 10/07/2022 13.25, Corentin Jabot wrote:
>>
>> On Sun, Jul 10, 2022 at 8:37 AM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>
>> On 10/07/2022 00.03, Corentin Jabot wrote:
>> >
>> >
>> > On Sat, Jul 9, 2022 at 9:03 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>>
>> > Are there places other than identifiers where we can have UCNs
>> > outside of char/string literals? If not, maybe we should massage
>> > the grammar definition of _identifier_ instead of persisting
>> > the handwaving in lex.phases p4.
>> >
>> >
>> > The idea of doing it there, as we form preprocessor tokens, is that we don't want to
>> > int i\N{SEMICOLON} to do something (I don't think implementers would like that).
>>
>> I don't understand. What does "do something" want to say?
>> If we specify _identifier_ to accept any well-formed UCN
>> and we then say that an _identifier_ containing a UCN
>> for a basic character is ill-formed, that would seem
>> to work.
>>
>>
>> I think I got what you are saying.
>> We could put universal-character-name in the grammar of identifiers:
>>
>> identifier:<http://eel.is/c++draft/lex.name#nt:identifier>
>> /identifier-start/<http://eel.is/c++draft/lex.name#nt:identifier-start>
>> /identifier/<http://eel.is/c++draft/lex.name#nt:identifier> /identifier-continue/<http://eel.is/c++draft/lex.name#nt:identifier-continue>
>> identifier-start:<http://eel.is/c++draft/lex.name#nt:identifier-start>
>> /nondigit/<http://eel.is/c++draft/lex.name#nt:nondigit>
>> an element of the translation character set of class XID_Start
>> /_universal-character-name_/
>> identifier-continue:<http://eel.is/c++draft/lex.name#nt:identifier-continue>
>> /digit/<http://eel.is/c++draft/lex.name#nt:digit>
>> /nondigit/<http://eel.is/c++draft/lex.name#nt:nondigit>
>> an element of the translation character set of class XID_Continue
>> /_universal-character-name_/
>>
>>
>> Because identifiers are maximally munched, this would work, and we could remove the wording from phase 4.
>> (We would need some additional wording in [lex.name<http://lex.name>] of course).
>> Was that your idea?
> Yes, something like that. We'd need to say that the UCNs
> are replaced by translation characters and then must still
> satisfy the _identifier_ production (i.e. XID_Start, XID_Continue).
>
> Can we express C++-meaningful whitespace using a UCN?
Not at the moment
> Wasn't there an idea somewhere that we maybe want a
> double-width space as regular C++ whitespace?
Yes. We have several related SG16 issues. Issue #69 specifically
discusses ideographic space.
* Issue #69: Specify what constitutes white-space characters
<https://github.com/sg16-unicode/sg16/issues/69>
* Issue #70: Specify what constitutes a new-line
<https://github.com/sg16-unicode/sg16/issues/70>
* Issue #74: Extend whitespace to include NEL, LS, PS, LRM, RLM, and
maybe ALM <https://github.com/sg16-unicode/sg16/issues/74>
We have previously discussed defining whitespace based on Unicode
properties in which case there are two to choose from:
* The Pattern_White_Space
<https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:Pattern_White_Space=Yes:]>
property specifies a limited set of whitespace characters; doing
what issue #74 suggests would align C++ with this set.
* The White_Space
<https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:White_Space=Yes:]>
property specifies more characters (including ideographic space) but
does not include the LTR and RTL marks that Pattern_White_Space does.
Tom.
>
>> In which case, I like the direction
> Jens
> On 10/07/2022 13.25, Corentin Jabot wrote:
>>
>> On Sun, Jul 10, 2022 at 8:37 AM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>
>> On 10/07/2022 00.03, Corentin Jabot wrote:
>> >
>> >
>> > On Sat, Jul 9, 2022 at 9:03 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>>
>> > Are there places other than identifiers where we can have UCNs
>> > outside of char/string literals? If not, maybe we should massage
>> > the grammar definition of _identifier_ instead of persisting
>> > the handwaving in lex.phases p4.
>> >
>> >
>> > The idea of doing it there, as we form preprocessor tokens, is that we don't want to
>> > int i\N{SEMICOLON} to do something (I don't think implementers would like that).
>>
>> I don't understand. What does "do something" want to say?
>> If we specify _identifier_ to accept any well-formed UCN
>> and we then say that an _identifier_ containing a UCN
>> for a basic character is ill-formed, that would seem
>> to work.
>>
>>
>> I think I got what you are saying.
>> We could put universal-character-name in the grammar of identifiers:
>>
>> identifier:<http://eel.is/c++draft/lex.name#nt:identifier>
>> /identifier-start/<http://eel.is/c++draft/lex.name#nt:identifier-start>
>> /identifier/<http://eel.is/c++draft/lex.name#nt:identifier> /identifier-continue/<http://eel.is/c++draft/lex.name#nt:identifier-continue>
>> identifier-start:<http://eel.is/c++draft/lex.name#nt:identifier-start>
>> /nondigit/<http://eel.is/c++draft/lex.name#nt:nondigit>
>> an element of the translation character set of class XID_Start
>> /_universal-character-name_/
>> identifier-continue:<http://eel.is/c++draft/lex.name#nt:identifier-continue>
>> /digit/<http://eel.is/c++draft/lex.name#nt:digit>
>> /nondigit/<http://eel.is/c++draft/lex.name#nt:nondigit>
>> an element of the translation character set of class XID_Continue
>> /_universal-character-name_/
>>
>>
>> Because identifiers are maximally munched, this would work, and we could remove the wording from phase 4.
>> (We would need some additional wording in [lex.name<http://lex.name>] of course).
>> Was that your idea?
> Yes, something like that. We'd need to say that the UCNs
> are replaced by translation characters and then must still
> satisfy the _identifier_ production (i.e. XID_Start, XID_Continue).
>
> Can we express C++-meaningful whitespace using a UCN?
Not at the moment
> Wasn't there an idea somewhere that we maybe want a
> double-width space as regular C++ whitespace?
Yes. We have several related SG16 issues. Issue #69 specifically
discusses ideographic space.
* Issue #69: Specify what constitutes white-space characters
<https://github.com/sg16-unicode/sg16/issues/69>
* Issue #70: Specify what constitutes a new-line
<https://github.com/sg16-unicode/sg16/issues/70>
* Issue #74: Extend whitespace to include NEL, LS, PS, LRM, RLM, and
maybe ALM <https://github.com/sg16-unicode/sg16/issues/74>
We have previously discussed defining whitespace based on Unicode
properties in which case there are two to choose from:
* The Pattern_White_Space
<https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:Pattern_White_Space=Yes:]>
property specifies a limited set of whitespace characters; doing
what issue #74 suggests would align C++ with this set.
* The White_Space
<https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:White_Space=Yes:]>
property specifies more characters (including ideographic space) but
does not include the LTR and RTL marks that Pattern_White_Space does.
Tom.
>
>> In which case, I like the direction
> Jens
Received on 2022-07-10 22:33:35