Date: Wed, 3 Jun 2026 06:09:59 +0200
> Next, if I ask Claude what data it was given about the C++ standard, it
> says it was trained on "commentary, documentation, and discussion during
> training — not verbatim text." It can identify final drafts like N4950 as
> being available, but for some reason it needs to be explicitly encouraged
> to consult that document.
>
Not sure why you'd trust it to accurately know its own training data.
In my experience, ChatGPT is actually somewhat accurate in language
lawyering sometimes. It may hallucinate wording and subclause references
that don't exist (or at least not in the current draft), but it does get
you the right answer sometimes. That isn't to say you should rely on it for
the right answer, but this would only be possible if our working drafts
were part of its training data, since it doesn't even seem to require an
internet search to give you sometimes correct answers.
> In general, the AI companies are being very careful to avoid been seen to
> use copywritten data like the C++ standard.
>
Like the time NVidia allegedly contacted Anna's Archive to secure access to
500 terabytes of copyrighted material? Similar stories for pretty much all
the "AI companies". I'm not getting the impression that even one of them is
"being very careful".
> If we want AI generated responses and AI generated code to be as modern
> and correct as possible, I think it would make sense to release the
> copyright to the AI companies to use in training. And then insist they used
> that information as purveyors of programming tools.
>
Well, it's almost certainly in their training data anyway, legally or not.
Generally, I don't think that cppreference or the C++ standard have all
that much of an impact in AI training. AI is largely trained on public code
and on individual problems, and the answers always seem to be along the
lines of what the average C++ code on GitHub would look like, for better or
worse. This means AI is very eager to use std::string and std::vector all
over the place instead of something non-allocating and instead of
std::string_view or std::span. It's also hard to get it to use string
literal suffixes, char8_t, C++23 features, etc. even if the surrounding
code has that style.
Unless the AI providers put in a lot of effort into advanced C++ training
methods, just dumping the C++ standard or cppreference into the training
set won't do much, I suspect.
> says it was trained on "commentary, documentation, and discussion during
> training — not verbatim text." It can identify final drafts like N4950 as
> being available, but for some reason it needs to be explicitly encouraged
> to consult that document.
>
Not sure why you'd trust it to accurately know its own training data.
In my experience, ChatGPT is actually somewhat accurate in language
lawyering sometimes. It may hallucinate wording and subclause references
that don't exist (or at least not in the current draft), but it does get
you the right answer sometimes. That isn't to say you should rely on it for
the right answer, but this would only be possible if our working drafts
were part of its training data, since it doesn't even seem to require an
internet search to give you sometimes correct answers.
> In general, the AI companies are being very careful to avoid been seen to
> use copywritten data like the C++ standard.
>
Like the time NVidia allegedly contacted Anna's Archive to secure access to
500 terabytes of copyrighted material? Similar stories for pretty much all
the "AI companies". I'm not getting the impression that even one of them is
"being very careful".
> If we want AI generated responses and AI generated code to be as modern
> and correct as possible, I think it would make sense to release the
> copyright to the AI companies to use in training. And then insist they used
> that information as purveyors of programming tools.
>
Well, it's almost certainly in their training data anyway, legally or not.
Generally, I don't think that cppreference or the C++ standard have all
that much of an impact in AI training. AI is largely trained on public code
and on individual problems, and the answers always seem to be along the
lines of what the average C++ code on GitHub would look like, for better or
worse. This means AI is very eager to use std::string and std::vector all
over the place instead of something non-allocating and instead of
std::string_view or std::span. It's also hard to get it to use string
literal suffixes, char8_t, C++23 features, etc. even if the surrounding
code has that style.
Unless the AI providers put in a lot of effort into advanced C++ training
methods, just dumping the C++ standard or cppreference into the training
set won't do much, I suspect.
Received on 2026-06-03 04:10:14
