Next, if I ask Claude what data it was given about the C++ standard, it says it was trained on "commentary, documentation, and discussion during training — not verbatim text." It can identify final drafts like N4950 as being available, but for some reason it needs to be explicitly encouraged to consult that document.
Not sure why you'd trust it to accurately know its own training data.
In my experience, ChatGPT is actually somewhat accurate in language lawyering sometimes. It may hallucinate wording and subclause references that don't exist (or at least not in the current draft), but it does get you the right answer sometimes. That isn't to say you should rely on it for the right answer, but this would only be possible if our working drafts were part of its training data, since it doesn't even seem to require an internet search to give you sometimes correct answers.
In general, the AI companies are being very careful to avoid been seen to use copywritten data like the C++ standard.
Like the time NVidia allegedly contacted Anna's Archive to secure access to 500 terabytes of copyrighted material? Similar stories for pretty much all the "AI companies". I'm not getting the impression that even one of them is "being very careful".
If we want AI generated responses and AI generated code to be as modern and correct as possible, I think it would make sense to release the copyright to the AI companies to use in training. And then insist they used that information as purveyors of programming tools.
Well, it's almost certainly in their training data anyway, legally or not.
Generally, I don't think that cppreference or the C++ standard have all that much of an impact in AI training. AI is largely trained on public code and on individual problems, and the answers always seem to be along the lines of what the average C++ code on GitHub would look like, for better or worse. This means AI is very eager to use std::string and std::vector all over the place instead of something non-allocating and instead of std::string_view or std::span. It's also hard to get it to use string literal suffixes, char8_t, C++23 features, etc. even if the surrounding code has that style.
Unless the AI providers put in a lot of effort into advanced C++ training methods, just dumping the C++ standard or cppreference into the training set won't do much, I suspect.