Date: Fri, 9 May 2025 14:07:10 +0200
It does not work this way.
If you give a LLM all papers to read, it can only do so one at a time and it has a limited context window.
If you train or fine-tune to reproduce the papers and it can do so, it still has not thought about the papers, so it cannot answer overarching questions. You would have to train with lots of questions (and hopefully answers) for it to make the connections between the C++ concepts and the papers. So if the same question was asked on a trained mailing list, it can perhaps reproduce it.
That's why you usually use some RAG or hybrid model.
- The LLM interprets the question of the user, creates search keywords. Those papers are retrieved.
- One-by-one the API feeds the papers into a separate session trying to find out, whether the paper is relevant and what it can read about it.
- Another session combines all those findings
Alternatively you create summaries of all papers, so they all fit into the context window of a single session.
It now quickly gets OT. So perhaps continue off-list?
-----Ursprüngliche Nachricht-----
Von:Frederick Virchanza Gotham via Std-Proposals <std-proposals_at_[hidden]>
Gesendet:Fr 09.05.2025 14:05
Betreff:Re: [std-proposals] Dedicated website with AI that has processed all papers
An:std-proposals_at_[hidden];
CC:Frederick Virchanza Gotham <cauldwell.thomas_at_[hidden]>;
On Fri, May 9, 2025 at 10:34 AM Sebastian Wittmeier wrote:
>
> They normally won't let you use their domain.
>
> They have an API access, which you can do from your server, either the answer being interpreted by your server code or directly presented to the user. It would run on your domain.
>
> It is quite inexpensive per request. Depending on the model, less than a cent or a few cents.
>
> You can also create some kind of plugins for the public ChatGPT instead, but that is much less powerful and I would not suggest it for that use.
At the end of this post I've written a C++ program to get all the
papers and revision numbers. There have been 6545 papers submitted if
you include all the revisions.
Here's quite a simple test to see how well ChatGPT knows all the
papers. I asked it:
"Of all the C++ papers from P0001R0 up to P3672R0, how many of
these papers mention the word 'zip'?"
It came back with one paper that had 'zip' in its title, but at the end it said:
"For a thorough analysis, one would need to review each paper
individually or utilize a search tool that indexes the content of
these documents."
So I retorted with:
"Can you please read through each individual paper for me to confirm?"
but then it just keeps coming back with suggestions for how to
automate the search. The bottom line is that the free ChatGPT refuses
to read through the 6545 papers.
So I asked ChatGPT:
"I want to set up a website that has ChatGPT that has been trained
on the 6545 papers that have been submitted. I want people around the
world to be able to use it, say 10 people every hour. What kind of
money would this cost me?", and it came back with:
Component Cost Range (Monthly)
-------------------------------------------
Hosting & DB $30–$70
LLM API (GPT-4o) $30–$50
Miscellaneous $10–$20
Total ~$70–$140/month
Next I asked: "What if I wanted to enable it to be used by hundreds if
not thousands of people every day? What would that cost me?", and it
came back with a max total cost of $600 per month. So it's not
absolutely crazy money. These would be small numbers to the people
that make charitable contributions.
Of course I can use the below program to download all the papers to my
webspace, and then use some really advanced text-searching software,
but an AI would be much more versatile. I mean I want to be able to
ask it questions like, "Of all the 6545 papers submitted so far, pick
out the one's that suggestion adding a new CV-qualifier to the
language". You wouldn't be able to make this query with an advanced
search tool as you'd get way too many false positives no matter what
way you try to word it.
I would also train the AI on all the posts to this mailing list (and
older mailing lists) from 1990 to 2025, but I would give the contents
of the papers a higher priority than the contents of the posts.
[ start C++ program ]
#include <cstddef> // size_t
#include <iostream> // cout, cerr
#include <regex> // regex, smatch
#include <set> // set
#include <string> // string, to_string
#include <curl/curl.h> // CURL, curl_easy_init
#include "Auto.h" // The 'Auto' macro
using std::cout, std::endl, std::string, std::set, std::size_t;
struct Paper {
unsigned num, rev;
bool operator<(Paper const other) const noexcept
{
return (num < other.num) || ( (num == other.num) && (rev < other.rev) );
}
char const *str(void) const noexcept
{
static thread_local char s[] = "PxxxxRxx";
s[1] = '0' + num / 1000u % 10u;
s[2] = '0' + num / 100u % 10u;
s[3] = '0' + num / 10u % 10u;
s[4] = '0' + num / 1u % 10u;
if ( rev < 10u )
{
s[6] = '0' + rev;
s[7] = '\0';
}
else
{
s[6] = '0' + rev / 10u % 10u;
s[7] = '0' + rev / 1u % 10u;
s[8] = '\0';
}
return s;
}
};
std::ostream &operator<<(std::ostream &os, Paper const paper)
{
return os << paper.str();
}
size_t WriteCallback(void *const contents, size_t const size, size_t
const nmemb, void *const userp) noexcept
{
try
{
string *const data = static_cast<string*>(userp);
data->append( static_cast<char*>(contents), size * nmemb );
return size * nmemb;
}
catch(...)
{
return 0u;
}
}
set<Paper> papers;
void ExtractPaperCodes(string const &content)
{
std::regex pattern(R"(P(\d{4})R(\d+))");
std::smatch match;
string::const_iterator search_start( content.cbegin() );
while ( std::regex_search(search_start, content.cend(), match, pattern) )
{
unsigned const num = static_cast<unsigned>(std::stoul(match[1].str())),
rev = static_cast<unsigned>(std::stoul(match[2].str()));
papers.insert( Paper{ num, rev } );
search_start = match.suffix().first;
}
}
void FetchPaperCodesForYear(string const &year_url)
{
CURL *const curl = curl_easy_init();
if ( nullptr == curl ) throw std::runtime_error("Failed to
initialise CURL library");
Auto( curl_easy_cleanup(curl) );
CURLcode res;
string read_buffer;
curl_easy_setopt(curl, CURLOPT_URL, year_url.c_str());
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback );
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &read_buffer );
res = curl_easy_perform(curl);
if ( CURLE_OK != res ) throw std::runtime_error("CURL request
failed for " + year_url + ": " + curl_easy_strerror(res));
ExtractPaperCodes(read_buffer);
}
auto main(void) -> int
{
cout << "Fetching paper codes for year: " << std::flush;
for ( unsigned year = 1989u; year <= 2025u; ++year )
{
string const year_url =
"https://www.open-std.org/jtc1/sc22/wg21/docs/papers/" +
std::to_string(year) + "/";
cout << (1989u==year ? "" : ", ") << year << std::flush;
FetchPaperCodesForYear(year_url);
}
for ( Paper const &paper : papers ) cout << endl << paper;
cout << "\n\nTotal unique papers found: " << papers.size() << endl;
}
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
Received on 2025-05-09 12:14:16