sg16: Re: [SG16] [isocpp-ext] P1949R4 - C++ Identifier Syntax using Unicode Standard Annex 31

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 18 Jun 2020 16:05:44 +0200

On Thu, Jun 18, 2020, 15:52 Matthew Woehlke via SG16 <sg16_at_[hidden]>
wrote:

> On 05/06/2020 16.35, Steve Downey via Ext wrote:
> > Last week SG16 (Text) approved forwarding this paper to EWG for
> > consideration. It addresses fixing the state of allowed identifiers in
> C++.
> >
> > https://isocpp.org/files/papers/P1949R4.html (also attached as
> d1949.html)
> >
> > Summary <https://isocpp.org/files/papers/D1949R4.html#summary>
> >
> > The allowed Unicode code points in identifiers include many that are
> > unassigned or unnecessary, and others that are actually
> counter-productive.
> > By adopting the recommendations of UAX #31, Unicode Identifier and
> Pattern
> > Syntax, C++ will be easier to work with in international environments and
> > less prone to accidental problems.
> >
> > This proposal does not address some potential security concerns—so called
> > homoglyph attacks—where letters that appear the same may be treated as
> > distinct. Methods of defense against such attacks are complex and
> evolving,
> > and requiring mitigation strategies would impose substantial
> implementation
> > burden.
> >
> > This proposal also recommends adoption of Unicode normalization form C
> > (NFC) for identifiers to ensure that when compared, identifiers intended
> to
> > be the same will compare as equal. Legacy encodings are generally
> naturally
> > in NFC when converted to Unicode. Most tools will, by default, produce
> NFC
> > text.
> >
> > Some unusual scripts require the use of characters as joiners that are
> not
> > allowed by UAX #31, these will no longer be available as identifiers in
> C++.
> >
> > As a side-effect of adopting the identifier characters from UAX #31,
> using
> > emoji in or as identifiers becomes ill-formed.
> >
> > See also
> > https://unicode.org/reports/tr31/ Unicode® Standard Annex #31 UNICODE
> > IDENTIFIER AND PATTERN SYNTAX
>
> Okay... I have a potential strong objection to this. It is not clear to
> me (not being a unicode expert) how this will interact with the many,
> many existing tools (ot to mention programmer muscle memory) that
> defines identifiers as:
>
> [_[:alpha:]][_[:alnum:]]*
>

[_\p{XID_Start}]\p{XID_Continue}*

>
> In particular, I note that allowed characters include "nonspacing marks,
> spacing combining marks", and "connector punctuation", which don't sound
> like they would be matched by [[:alnum:]].
>
> I would very, ***VERY*** strongly like to see an analysis of whether
> this change is going to break existing tools that rely on the above
> definition of identifiers.
>

These tools probably don't work as it is with the set of characters allowed
currently. For example, emojis are currently allowed and not matched by
your regex
Note also that "a\u00e9" can appear verbatim in code which the regex
wouldn't match

> Note that I *do* expect "𝕰𝖛𝖔𝖑𝖚𝖙𝖎𝖔𝖓" to match the above
> specification and will happily consider it a bug in the tool if it does
> not.
>
> I will also happily argue that we should continue to disallow
> punctuation in identifiers, even if notionally required by some scripts.
> After all, we don't currently allow:
>
> int can'tNameThis;
>

How people interpret punctuation is very culture dependant, the things this
paper allows are not understood as breaking by the culture which use them

>
> (That said, I would hope and expect compilers that are less strict will
> continue to offer that as an option. It seems this would be required
> even with the paper in its current state.)
>

The goal of the paper is to bring a bit more portability accross compilers.

>
> --
> Matthew
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

On Thu, Jun 18, 2020, 15:52 Matthew Woehlke via SG16 <sg16_at_[hidden]>
wrote:

> On 05/06/2020 16.35, Steve Downey via Ext wrote:
> > Last week SG16 (Text) approved forwarding this paper to EWG for
> > consideration. It addresses fixing the state of allowed identifiers in
> C++.
> >
> > https://isocpp.org/files/papers/P1949R4.html (also attached as
> d1949.html)
> >
> > Summary <https://isocpp.org/files/papers/D1949R4.html#summary>
> >
> > The allowed Unicode code points in identifiers include many that are
> > unassigned or unnecessary, and others that are actually
> counter-productive.
> > By adopting the recommendations of UAX #31, Unicode Identifier and
> Pattern
> > Syntax, C++ will be easier to work with in international environments and
> > less prone to accidental problems.
> >
> > This proposal does not address some potential security concerns—so called
> > homoglyph attacks—where letters that appear the same may be treated as
> > distinct. Methods of defense against such attacks are complex and
> evolving,
> > and requiring mitigation strategies would impose substantial
> implementation
> > burden.
> >
> > This proposal also recommends adoption of Unicode normalization form C
> > (NFC) for identifiers to ensure that when compared, identifiers intended
> to
> > be the same will compare as equal. Legacy encodings are generally
> naturally
> > in NFC when converted to Unicode. Most tools will, by default, produce
> NFC
> > text.
> >
> > Some unusual scripts require the use of characters as joiners that are
> not
> > allowed by UAX #31, these will no longer be available as identifiers in
> C++.
> >
> > As a side-effect of adopting the identifier characters from UAX #31,
> using
> > emoji in or as identifiers becomes ill-formed.
> >
> > See also
> > https://unicode.org/reports/tr31/ Unicode® Standard Annex #31 UNICODE
> > IDENTIFIER AND PATTERN SYNTAX
>
> Okay... I have a potential strong objection to this. It is not clear to
> me (not being a unicode expert) how this will interact with the many,
> many existing tools (ot to mention programmer muscle memory) that
> defines identifiers as:
>
> [_[:alpha:]][_[:alnum:]]*
>
> In particular, I note that allowed characters include "nonspacing marks,
> spacing combining marks", and "connector punctuation", which don't sound
> like they would be matched by [[:alnum:]].
>
> I would very, ***VERY*** strongly like to see an analysis of whether
> this change is going to break existing tools that rely on the above
> definition of identifiers.
>
> Note that I *do* expect "𝕰𝖛𝖔𝖑𝖚𝖙𝖎𝖔𝖓" to match the above
> specification and will happily consider it a bug in the tool if it does
> not.
>
> I will also happily argue that we should continue to disallow
> punctuation in identifiers, even if notionally required by some scripts.
> After all, we don't currently allow:
>
> int can'tNameThis;
>
> (That said, I would hope and expect compilers that are less strict will
> continue to offer that as an option. It seems this would be required
> even with the paper in its current state.)
>
> --
> Matthew
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-06-18 09:09:25