On 1/25/22 1:38 AM, Reini Urban via SG16 wrote:


Tom Honermann <tom@honermann.net> schrieb am Di., 25. Jan. 2022, 07:03:
On 1/24/22 11:43 AM, Reini Urban via SG16 wrote:
I've just released libu8ident, which implements the TR39 checks (and some TR31 profiles) for unicode identifiers. UTF-8 only. wchar support would be trivial, but I don't think anybody (but MSVC) would need that.

That's the only such tool which is not in Rust or Java. I've only implemented it previously in my perl5 fork in 2016.

This accompanies my recent WG21 and WG14 papers P2528R0 and n2916
https://rurban.github.io/libu8ident/doc/P2528R0.html

Thank you, Reini. I will get these scheduled for review in SG16. Please note that we are now beyond the deadline for new papers for C23 and C++23, so review will be directed towards later standards. Our immediate priority is to finalize features that have been accepted for C++23. As a result, it may be a few months before these papers get scheduled in SG16. Though an argument could be made that your proposal constitutes modification of a feature accepted for C++23 (P1949) and therefore in scope for that standard, I see your proposal as more of a competing one rather than a modification. P1949 effectively brought the standard up to date with more recent Unicode versions without changing the design intent; the changes you propose are a change in direction and more disruptive.


In my point of view is that C11 made identifiers insecure by making them non-identifiable, and adopting TR39 will fix that spec bug. So a bugfix, not a feature.
That is ok. If the committee accepts the proposal and agrees to categorize it as a bug fix, then it can be adopted as a Defect Report (DR) with the intent that implementors apply it to previous standards.

That is ok, but I  we'll need more time than we currently have available to understand the impact. I think implementation experience in a C++ compiler may be needed to really understand the effect on existing code bases.


In that regard I'd already asked Fedora admins to do a scan of the Linux packages. Will ask GitHub also, but this would then need a CVE, and this might draw too much unneeded attention. The trojansource guy did it this way.

Scanning packages for identifiers that would be restricted under your proposal seems useful, but may not be sufficient. If I understand the proposal correctly (and there is a reasonable chance that I do not; I've only quickly scanned your papers so far), the script restrictions apply to translation units, not to files. This could cause issues for project composition due to the inclusion of header files or import of modules that have identifiers written in competing scripts. Thus, actually building such software as opposed to just scanning the files may be required to identify such cases. Please correct me if I have misunderstood the proposal.


Note that the paper linked as P2528R0 above identifies itself as P2538R0 in the document header.

I'm not able to find a P2528R0 or a P2538R0 in the WG21 paper archive. Has the paper been submitted to WG21?


Yes, but Hal had not posted it yet. I froze it though.
The WG14 variant is also accepted and frozen, but not posted yet.
Thank you. I did manage to find P2528R0 in the WG21 repository and N2916 in the WG14 one after all.

A bit of process information in case you aren't aware: draft revisions of papers should be identified as, for example, D2528R<N> and only marked as, for example, P2528R<N> once submitted. Once a paper has been distributed with a "P" designation, it should not be modified again (this is to ensure that everyone viewing a "P" paper sees the same content).


Yes, that has a good description on the webpages.

Excellent :)

Tom.


I've also came up with 2 TR31 bug reports, sent to unicode via their contact form.
They might fix their tables for version 15 then. See doc/tr31-bugs/md

The link for that document appears to be https://rurban.github.io/libu8ident/doc/tr31-bugs.md.

Tom.

--
Reini Urban