C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics

From: Kamalesh Lakkampally <founder_at_[hidden]>
Date: Thu, 4 Dec 2025 15:43:38 +0530
Hello David,

Thank you for taking the time to restate the scenario in your own terms —
that actually comes very close to what I have in mind, and it helps me see
where my earlier explanations were unclear.
1. Your restatement of the use-case

Your description with:

   -

   a Task type,
   -

   a task_list,
   -

   a schedule_tasks(task_list) function that reorders / updates it, and
   -

   an execute_task_list() loop that walks the list and calls each task

is indeed very close to the core use-case.

Conceptually, that’s exactly the pattern I’m thinking about:

void execute_task_list() {
    for (auto&& task : task_list) {
        task(); // T1, then T2, then T3, in some dynamic order
    }
}

The key observation (which you also pointed out) is that when task_list is
dynamic and updated frequently, the overhead of:

   -

   the loop body,
   -

   repeated calls,
   -

   and the dispatcher/scheduler structure around it

can become significant relative to the cost of the tasks themselves.
2. Clarifying the “side-effect free” comment

You are right to call out the contradiction in my earlier wording.
I did not mean that all tasks are strictly side-effect-free in the formal
C++ sense. What I meant is that the *sequencing mechanism itself* (the
“fetch-only” layer) should not have side effects — it only describes the
order in which tasks are evaluated.

The tasks themselves may certainly have side effects; in many real systems
they do. The order therefore does matter, but the role of the sequencing
layer is to:

   -

   describe “T1, then T2, then T3, in this dynamically determined order”,
   -

   without the sequencing operations themselves performing observable work.

That was an imprecise explanation on my part, and your comment helped me
see that.
3. What I am trying to explore *beyond* the current model

Your example of generating:

void execute_task_list() {
    T1();
    T2();
    T3();
    ...
}

for a fixed schedule is exactly the kind of thing that compilers can
optimize very well today for static ordering. I fully agree that *with
dynamic ordering*, we cannot expect the same kind of aggressive static
inlining / unrolling / specialization.

I am not expecting the language to magically recover “perfect” static
optimization for dynamic schedules.

What I am exploring is whether C++ can:

   -

   make the *sequencing intent* explicit and separate from the dispatcher
   loop, so that
   -

   implementations have more freedom to choose different strategies for
   executing that sequence of tasks back-to-back, without the programmer
   having to encode that as a hand-written dispatcher.

In your terms, today we express sequencing as:

for (auto&& task : task_list) {
    task();
}

but this is semantically a loop + function call pattern. The compiler is
obligated to preserve that structure; it cannot, for example, replace it
with a different internal execution strategy that still maintains the same
observable semantics but removes some of the dispatcher overhead.

The “fetch-only” idea (which I clearly need to rename and formalize much
better) is about:

   -

   treating the task order as *first-class sequencing metadata*,
   -

   keeping the actual sequencing layer side-effect-free,
   -

   and allowing implementations to potentially run such a sequence in a
   more direct way than “go back to the loop and call the next function again”.

4. Relation to your suggestions (attributes and chaining)

I agree that attributes like [[preserve_none]] and hand-written
task-chaining (where each task calls the next) are powerful techniques —
and, as you noted, they move into compiler-specific territory today.

Part of what I am trying to probe is whether:

   -

   some of these patterns,
   -

   especially for dynamic, event-driven task sequences,

can be expressed in a portable, abstract way in C++ so that:

   -

   one implementation uses plain calls,
   -

   another uses more aggressive control-flow chaining,
   -

   another might leverage a JIT or a different runtime strategy,

while keeping the same abstract semantics.
5. Answering your final question

So, to answer your closing question:

Does that describe your use-case, or come close to it?

Yes, your Task / task_list / schedule_tasks scenario describes the use-case
quite well. My proposal is trying to see whether C++ can grow a way to:

   -

   separate the *description* of that dynamic schedule from its execution,
   -

   treat the schedule as first-class, side-effect-free sequencing metadata,
   -

   and give implementations room to optimize the execution of that schedule
   without forcing everything through a central dispatcher structure.

This is still early-stage and exploratory, and your restatement is very
helpful in reframing the explanation — thank you for that.





*Best Regards,Kamalesh Lakkampally,*
*Founder & CEO*
www.chipnadi.com



On Thu, Dec 4, 2025 at 2:56 PM David Brown <david.brown_at_[hidden]> wrote:

>
> I am still struggling to understand what you are trying to do. From
> this (and your other post of around the same time), I have a better idea
> of what you are /not/ trying to do - this is nothing to do with cache
> management or similar.
>
> Your "code today" example tells us very little, and does not appear to
> do what you describe. It's just a task queue, and a loop that
> repeatedly takes the head of the queue and executes it. There is no
> scheduling as such. And your "future code" version just turns things
> into magic functions, and again tells us very little.
>
> I also note that your description in your other post seems somewhat
> contradictory - if the tasks are side-effect free, then the order of
> execution should not matter.
>
>
> Let me try to describe what I think you might be wanting, but do so in a
> somewhat different way, with slightly different names. If my guess here
> is right, then this could help you write your proposal and help others
> here understand what you are looking for. If I am wrong, then at least
> we can rule out something more that your proposal is not about!
>
>
> Let "Task" be a callable type. Perhaps it is a very simple type (a
> pointer to a void-to-void function), perhaps it is more complicated.
>
> Let "tast_list" be a container of Task's. It could be a C-style array,
> a vector, a priority queue, or whatever.
>
> You have a function "schedule_tasks(task_list)" which will re-order the
> Task's in task_list (or perhaps add or remove items), according to your
> dynamic scheduling needs.
>
> Then you want to run through all these Task's in order, using something
> like :
>
> void execute_task_list() {
> for (auto&& task : task_list) {
> task();
> }
> }
>
> Your problem here is that the body of this function, together with the
> function call overhead, takes significant time in comparison to the
> actual executions of the tasks. So you want to find some way to tell
> the compiler that these Task's on task_list should be executed
> back-to-back, without all the extra fuss of saving and restoring
> registers that don't need to be restored, and that sort of thing.
>
>
> I don't see how a compiler could generate code that handles this
> optimally when the task order changes dynamically. When the task order
> is fixed, you could generate your execute_task_list() function as :
>
> void execute_task_list() {
> T1();
> T2();
> T3();
> ...
> }
>
> The compiler can handle this as one giant inline function and optimise
> to its heart's content. It is a very long time since I have worked with
> HDL simulations with generated C code, and I didn't do much of it, but
> in my limited experience this kind of hard-coded scheduling is what was
> used. The challenges here were that you ended up with massive functions
> that pushed compilers and compile-times to extremes, but that's an
> implementation issue rather than a language issue.
>
> With dynamic ordering of the tasks, you can't get such code. You are
> never going to get the optimisation benefits a static list can give you.
>
> It is likely that you can use a fair bit of compiler-specific features
> to get faster code, however. If you keep your types as simple as
> possible (pointer to void-to-void functions in a C-style array) and use
> gcc's "preserve_none" attribute (for x86-64 and AArch64 at least) to
> minimise function call overhead. Conceivably attributes like this could
> be standardised in the language so that the code is not
> compiler-specific, but that would be a quite different proposal.
>
> You could also pass a pointer to the current position in the task list
> to each task, and have each task call the next one on the list. That
> might reduce overhead even more (these "calls" could likely be turned
> into jumps). Just make sure that the final task on the list doesn't try
> to call any further down the chain!
>
>
> Does that describe your use-case, or come close to it?
>
> David
>
>
>
>
> On 04/12/2025 06:22, Kamalesh Lakkampally via Std-Proposals wrote:
> > Hello everyone,
> >
> > All of the messages I sent were written by me personally. I did not
> > use an LLM to generate the content. I typed everything myself, in plain
> > English, doing my best to explain the concept in detail with headlines
> > and with paragraph style.
> >
> > Thank you for the detailed questions and discussions. Let me clarify the
> > concept more precisely in terms of the C++ abstract machine, because
> > several responses show that my original message was unclear.
> >
> >
> > *1. What are “fetch-only operations” conceptually?*
> >
> > They are *sequencing operations*, not computations.
> > A fetch-only operation does *not*:
> >
> > *
> >
> > load from memory
> >
> > *
> >
> > store to memory
> >
> > *
> >
> > execute user code
> >
> > *
> >
> > allocate resources
> >
> > *
> >
> > produce side effects
> >
> > Instead, it carries *a piece of metadata describing what should execute
> > next(i.e where control goes) *, and the evaluation order is adjusted
> > accordingly.
> >
> >
> > *2. What problem does this solve?*
> >
> > In event-driven execution models (e.g., hardware simulators or
> > fine-grained task schedulers), programs often end up in patterns like:
> >
> > dispatcher();
> > → T1();
> > → dispatcher();
> > → T2();
> > → dispatcher();
> > → T3();
> > ...
> >
> > The /intent/ is simply:
> >
> > T1 → T2 → T3 → …
> >
> >
> > *_but the actual control flow repeatedly returns through the dispatcher,
> > causing excessive sequencing overhead. -->_ *I would be very
> > interested in hearing what existing C++ techniques the committee
> > considers suitable to express this kind of dynamic sequencing without
> > repeatedly returning through a dispatcher.
> >
> >
> > *3. Before / After example (as requested)*
> >
> > *Today:*
> >
> > **void scheduler() {
> >
> > while (!q.empty()) {
> > auto t = q.pop_front();
> > t(); // call and return repeatedly
> > }
> > }
> >
> >
> > Abstract machine control-flow:
> >
> > scheduler → T1 → scheduler → T2 → scheduler → T3 → …
> >
> >
> > With fetch-only sequencing operations:
> >
> > fetch_schedule(q); // purely sequencing metadata
> >
> > execute_scheduled(); // walks metadata in-order
> >
> >
> > Abstract machine control-flow:
> >
> > T1 → T2 → T3 → …
> >
> > he program expresses the /intended evaluation sequence/ explicitly,
> > rather than bouncing through a central dispatcher.
> >
> >
> > *4. This is NOT hardware prefetch (clarifying misunderstanding)*
> >
> > There is *no*:
> >
> > *
> >
> > cache hinting
> >
> > *
> >
> > memory prefetch
> >
> > *
> >
> > cache-control
> >
> > *
> >
> > MOVNTI/MOVNTDQA behavior
> >
> > The similarity to CPU PREFETCH was entirely unintentional and misleading
> > wording on my part.
> >
> >
> > *5. Why not rely on compiler as-if freedom?*
> >
> > An optimizer /could/ theoretically flatten the dispatcher pattern, but
> > in practice:
> >
> > *
> >
> > the task list is dynamic
> >
> > *
> >
> > the order changes frequently
> >
> > *
> >
> > the compiler cannot predict it
> >
> > This proposal explores whether C++ can provide a *mechanism to represent
> > dynamic sequencing explicitly*, separate from computation.
> >
> >
> > *6. About early-stage status*
> >
> > This is an early R0 exploratory draft.
> > I will refine the abstract-machine model and provide a clearer formalism
> > in subsequent revisions.
> > I appreciate everyone’s guidance in helping me align this more closely
> > with C++’s evaluation rules.
> >
> > Thank you again for the constructive feedback — I welcome further
> > questions and suggestions.
> >
> >
> >
> > *Best Regards,
> > Kamalesh Lakkampally,*
> > *Founder & CEO*
> > www.chipnadi.com <http://www.chipnadi.com>
> >
> >
> >
> > On Wed, Dec 3, 2025 at 10:18 PM Sebastian Wittmeier via Std-Proposals
> > <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>>
>
> > wrote:
> >
> > __
> >
> > So it is just hints for prefetching? Just for improved performance?
> >
> > I thought the general wisdom about prefetching instructions is that
> > their effect changes so much between processor architectures that it
> > makes not much sense to put them into universal code, as in most
> > cases the hardware is already better in optimization than any
> > optimizer, and the programmer themselves only can beat the optimizer
> > by hand-optimizing those is pure inline assembly for one specific
> > architecture?
> >
> > But probably there is more to it?
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> > <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
> >
> >
>

Received on 2025-12-04 10:14:17