C++ Logo

sg16

Advanced search

[SG16-Unicode] NL 029 : Disallow zero-width and control characters

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 24 Oct 2019 18:24:46 -0400
SG16 has an NB comment to deal with! Tom has already scheduled it for
Belfast. It's basically that the list of allowed code points have some
interesting control characters like zero width joiners and RTL modifiers.

https://github.com/cplusplus/nbballot/issues/28

There's also an issue that JF raised earlier:
https://github.com/sg16-unicode/sg16/issues/48
Improve support for Unicode characters in identifiers

Relevant unicode standard:
https://unicode.org/reports/tr31/ UNICODE IDENTIFIER AND PATTERN SYNTAX

Which is complicated because it allows things like identifiers written in
Farsi which requires zwj for disambiguation, and suggests regex to detect
particular allowed identifiers. It's fairly dense, and I haven't digested
it yet, but it looks like there might be allowed ways to exclude that.

Plus tailoring would be needed because C++ disallows some characters such
as '$' which might otherwise be allowed. This is also discussed in TR31.


My feeling on the comment is that it's not a new issue for C++20, so it's
not clear that it has to be fixed for C++20. I believe it should be fixed,
but it ought to be fixed in a principled manner, and that likely means
TR31.

We would also have to discuss if emoji are allowed in identifiers. TR31
does not strictly disallow them. The TonyTable shall be interesting.

Received on 2019-10-25 00:25:00