C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Dedicated website with AI that has processed all papers

From: Oliver Hunt <oliver_at_[hidden]>
Date: Wed, 28 May 2025 12:27:55 -0400
I think you need to be clearer about your goals.

If the intent is to publish a tool where people can post a paper and get a summary of it, that’s hugely different from a tool that you can ask “reword (explain/describe) proposal X”, or if your intent is not for it to be shared with a model that is trained on those papers, I _suspect_ you’re fine.

The problems arise (at least for me) when you distribute a model that is trained on copyrighted works (some folk may be ok with you using their papers for training, but you would need their permission).

Let’s drop reference to AI and go for a much simpler tool for rewording/searching: grep and sed

One option is “I have trained a model on all these papers and it can provide a summary”: This is functionally equivalent to “I have copied all of these papers into a directory, and if you ask about a paper it produces that paper after running sed to replace a bunch of words with their synonyms and remove the author’s names”.

The other is: “A person loads a paper into my AI and gets a description": This is functionally equivalent to the user puts a paper they have downloaded (and have the authors and original doc available) and your program runs sed and provides searching, etc.

The first option involves copying other peoples work and distributing it without consent, and the second doesn’t. It’s super important to understand “AI” descriptions are not anything more than a very expensive sed+grep - they’re a purely mechanical transform of original works, and the only reason for any variance is the deliberate addition of randomness - in the above sed example, it would be equivalent to having multiple synonyms for each word and choosing the synonym randomly each time.

—Oliver

> On May 28, 2025, at 9:15 AM, Frederick Virchanza Gotham via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
> On Wed, May 28, 2025 at 1:39 PM Jonathan Wakely wrote:
>>
>> Although I think you can probably see why people are concerned when
>> the thread topic says "AI that has processed all papers".
>>
>> That certainly sounds like an LLM that has been trained on all the
>> papers, not one that retains no knowledge of them between queries. If
>> that's not what you're talking about now, then it's a very misleading
>> title.
>
>
> I'm new to all this AI stuff. I'm not even entirely sure on the
> difference between 'training' and 'learning' and 'retrieval-augmented
> generation'.
>
> What I can tell you is that my program works sort of like as follows:
>
> auto handle = CreateHandleToArtificialIntelligenceModel();
>
> for ( auto const &e : container_all_papers )
> {
> DeleteEverythingYouKnow( handle ):
> LoadPaperIntoModel( handle );
> bool const is_matching = AskQuestionAndGetBooleanAnswer("Does
> this paper mention chocolate?");
> if ( is_matching ) AddToSearchResults( e );
> }
>
> So the AI's memory gets wiped before every individual paper is loaded
> in. Actually you can see the beginnings of that loop here:
>
> https://github.com/healytpk/paperkernelcxx/blob/b5557109ccf4e617e046be4df0cfbaf48f84574b/main_program/GUI_Dialog_Main.cpp#L338
>
> In the body of the loop, I invoke "NewContext" which wipes its memory,
> and then I invoke 'LoadPaper'. I can't test that code properly yet
> though coz I haven't got a proper graphics card. I should have a new
> laptop in a month or so though and I'll get cracking on it then.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-05-28 16:28:15