ISOCPP std-proposals List: Re: [std-proposals] Dedicated website with AI that has processed all papers

From: Sebastian Wittmeier <wittmeier_at_[hidden]>
Date: Fri, 9 May 2025 14:07:10 +0200

It does not work this way. If you give a LLM all papers to read, it can only do so one at a time and it has a limited context window. If you train or fine-tune to reproduce the papers and it can do so, it still has not thought about the papers, so it cannot answer overarching questions. You would have to train with lots of questions (and hopefully answers) for it to make the connections between the C++ concepts and the papers. So if the same question was asked on a trained mailing list, it can perhaps reproduce it. That's why you usually use some RAG or hybrid model. - The LLM interprets the question of the user, creates search keywords. Those papers are retrieved. - One-by-one the API feeds the papers into a separate session trying to find out, whether the paper is relevant and what it can read about it. - Another session combines all those findings Alternatively you create summaries of all papers, so they all fit into the context window of a single session. It now quickly gets OT. So perhaps continue off-list? -----Ursprüngliche Nachricht----- Von:Frederick Virchanza Gotham via Std-Proposals <std-proposals_at_[hidden]> Gesendet:Fr 09.05.2025 14:05 Betreff:Re: [std-proposals] Dedicated website with AI that has processed all papers An:std-proposals_at_[hidden]; CC:Frederick Virchanza Gotham <cauldwell.thomas_at_[hidden]>; On Fri, May 9, 2025 at 10:34 AM Sebastian Wittmeier wrote: > > They normally won't let you use their domain. > > They have an API access, which you can do from your server, either the answer being interpreted by your server code or directly presented to the user. It would run on your domain. > > It is quite inexpensive per request. Depending on the model, less than a cent or a few cents. > > You can also create some kind of plugins for the public ChatGPT instead, but that is much less powerful and I would not suggest it for that use. At the end of this post I've written a C++ program to get all the papers and revision numbers. There have been 6545 papers submitted if you include all the revisions. Here's quite a simple test to see how well ChatGPT knows all the papers. I asked it: "Of all the C++ papers from P0001R0 up to P3672R0, how many of these papers mention the word 'zip'?" It came back with one paper that had 'zip' in its title, but at the end it said: "For a thorough analysis, one would need to review each paper individually or utilize a search tool that indexes the content of these documents." So I retorted with: "Can you please read through each individual paper for me to confirm?" but then it just keeps coming back with suggestions for how to automate the search. The bottom line is that the free ChatGPT refuses to read through the 6545 papers. So I asked ChatGPT: "I want to set up a website that has ChatGPT that has been trained on the 6545 papers that have been submitted. I want people around the world to be able to use it, say 10 people every hour. What kind of money would this cost me?", and it came back with: Component Cost Range (Monthly) ------------------------------------------- Hosting & DB $30–$70 LLM API (GPT-4o) $30–$50 Miscellaneous $10–$20 Total ~$70–$140/month Next I asked: "What if I wanted to enable it to be used by hundreds if not thousands of people every day? What would that cost me?", and it came back with a max total cost of $600 per month. So it's not absolutely crazy money. These would be small numbers to the people that make charitable contributions. Of course I can use the below program to download all the papers to my webspace, and then use some really advanced text-searching software, but an AI would be much more versatile. I mean I want to be able to ask it questions like, "Of all the 6545 papers submitted so far, pick out the one's that suggestion adding a new CV-qualifier to the language". You wouldn't be able to make this query with an advanced search tool as you'd get way too many false positives no matter what way you try to word it. I would also train the AI on all the posts to this mailing list (and older mailing lists) from 1990 to 2025, but I would give the contents of the papers a higher priority than the contents of the posts. [ start C++ program ] #include <cstddef> // size_t #include <iostream> // cout, cerr #include <regex> // regex, smatch #include <set> // set #include <string> // string, to_string #include <curl/curl.h> // CURL, curl_easy_init #include "Auto.h" // The 'Auto' macro using std::cout, std::endl, std::string, std::set, std::size_t; struct Paper { unsigned num, rev; bool operator<(Paper const other) const noexcept { return (num < other.num) || ( (num == other.num) && (rev < other.rev) ); } char const *str(void) const noexcept { static thread_local char s[] = "PxxxxRxx"; s[1] = '0' + num / 1000u % 10u; s[2] = '0' + num / 100u % 10u; s[3] = '0' + num / 10u % 10u; s[4] = '0' + num / 1u % 10u; if ( rev < 10u ) { s[6] = '0' + rev; s[7] = '\0'; } else { s[6] = '0' + rev / 10u % 10u; s[7] = '0' + rev / 1u % 10u; s[8] = '\0'; } return s; } }; std::ostream &operator<<(std::ostream &os, Paper const paper) { return os << paper.str(); } size_t WriteCallback(void *const contents, size_t const size, size_t const nmemb, void *const userp) noexcept { try { string *const data = static_cast<string*>(userp); data->append( static_cast<char*>(contents), size * nmemb ); return size * nmemb; } catch(...) { return 0u; } } set<Paper> papers; void ExtractPaperCodes(string const &content) { std::regex pattern(R"(P(\d{4})R(\d+))"); std::smatch match; string::const_iterator search_start( content.cbegin() ); while ( std::regex_search(search_start, content.cend(), match, pattern) ) { unsigned const num = static_cast<unsigned>(std::stoul(match[1].str())), rev = static_cast<unsigned>(std::stoul(match[2].str())); papers.insert( Paper{ num, rev } ); search_start = match.suffix().first; } } void FetchPaperCodesForYear(string const &year_url) { CURL *const curl = curl_easy_init(); if ( nullptr == curl ) throw std::runtime_error("Failed to initialise CURL library"); Auto( curl_easy_cleanup(curl) ); CURLcode res; string read_buffer; curl_easy_setopt(curl, CURLOPT_URL, year_url.c_str()); curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback ); curl_easy_setopt(curl, CURLOPT_WRITEDATA, &read_buffer ); res = curl_easy_perform(curl); if ( CURLE_OK != res ) throw std::runtime_error("CURL request failed for " + year_url + ": " + curl_easy_strerror(res)); ExtractPaperCodes(read_buffer); } auto main(void) -> int { cout << "Fetching paper codes for year: " << std::flush; for ( unsigned year = 1989u; year <= 2025u; ++year ) { string const year_url = "https://www.open-std.org/jtc1/sc22/wg21/docs/papers/" + std::to_string(year) + "/"; cout << (1989u==year ? "" : ", ") << year << std::flush; FetchPaperCodesForYear(year_url); } for ( Paper const &paper : papers ) cout << endl << paper; cout << "\n\nTotal unique papers found: " << papers.size() << endl; } -- Std-Proposals mailing list Std-Proposals_at_[hidden] https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-05-09 12:14:16