What is missing is the approach to preprocess training data with AI, either to filter out old idioms or to convert those into modern code.

 

I think that is stronger than finetuning or custom instructions after training with old code.

 

It depends on the quality (and feasibility) of this automatic preprocessing.

 

 

(some may find this worse: An AI trained not on human-written code, but on AI generated or at least refactored code.)
 

-----Ursprüngliche Nachricht-----
Von: Adrian Johnston via Std-Proposals <std-proposals@lists.isocpp.org>
Gesendet: Di 02.06.2026 22:44
Betreff: [std-proposals] Strategic Direction for AI in C++: Governance, and Ecosystem
An: C++ Proposals <std-proposals@lists.isocpp.org>;
CC: Adrian Johnston <ajohnston4536@gmail.com>;
Recently (2026-02-23) the ISO C++ Directions Group (DG) / WG21 published a document:
 
Strategic Direction for AI in C++: Governance, and Ecosystem
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2026/p4023r0.pdf
 
As one of its findings it identified a problem with "Garbage In, Garbage Out".
 
The DG sees or recognizes a critical "Garbage In, Garbage Out" problem facing C++ developers using AI. Current models are trained on legacy C++ (C++98/03), vendor-specific dialects, and unsafe patterns found online.
 
I'd say this is an understatement.
 
What I am observing is that high quality websites like https://en.cppreference.com/ are blocking AI search tools because they don't generate advertising revenue. And so my AI (Claude) routinely ends searching for online posts made by people who are confused and asking for help and getting terse responses that may be incomplete at best.
 
Next, if I ask Claude what data it was given about the C++ standard, it says it was trained on "commentary, documentation, and discussion during training — not verbatim text." It can identify final drafts like N4950 as being available, but for some reason it needs to be explicitly encouraged to consult that document.
 
In general, the AI companies are being very careful to avoid been seen to use copywritten data like the C++ standard.
 
If we want AI generated responses and AI generated code to be as modern and correct as possible, I think it would make sense to release the copyright to the AI companies to use in training. And then insist they used that information as purveyors of programming tools.
 
If it is well known that there is no barrier to training an AI correctly on the most recent C++ standard and that users should expect verbatim information, and standards aware code from their AI, then I would hope for some improvement on the current situation. It is very easy to add  RLHF training data if the AI company is allowed to use the standard to create it.
 
Oddly enough, Claude is capable of providing more modern code when requested. In general, I find AI has a serious issue where (for no reason) it assumes your software may be 10 years out of date, unless told otherwise.
 
Regards,
Adrian Johnston
 
 
 
-- 
 Std-Proposals mailing list
 Std-Proposals@lists.isocpp.org
 https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals