Date: Thu, 4 Dec 2025 11:46:03 +0100
As it can be alternatively implemented with a priority queue and function calls, your proposal still serves two purposes
- better description of intent in the source code
- possibly huge performance gains on special hardware
Would the dispatch order be specified down to a unique order or just with constraints? E.g. those 5 before those 7, but within the 5 or within the 7 any order?
The dispatched 'functions' would have an effect on each other, e.g. through global variables or some shared objects or shared local variables? Otherwise their order of execution would not matter (with some exceptions like I/O).
If the order is dynamic, it has to be stored somehow. You are talking about metadata.
Can the order be determined at compile-time? We have now very powerful compile-time facilities in C++, when very complex functions written like runtime C++ can compute. Or must it depend on runtime input?
How would an example hardware look like, which can dynamically in a fast way dispatch snippets of code? Daniel brought the idea of jumping to the next snippet instead of returning, but still the addresses have to be read.
You can also bring up totally different hardware, like simulating on a FPGA itself.
It would help to know, where it is compiled to, to know, whether and how performance would be gained. Otherwise we are removing one of the two goals (performance) stated in the beginning of this message.
How many machine instructions would those snippets typically have in your implementation? Or how many cycles should they typically run? 2? 10? 50? 2000?
-----Ursprüngliche Nachricht-----
Von:Kamalesh Lakkampally via Std-Proposals <std-proposals_at_[hidden]>
Gesendet:Do 04.12.2025 11:14
Betreff:Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics
An:David Brown <david.brown_at_[hidden]>;
CC:Kamalesh Lakkampally <founder_at_[hidden]>; std-proposals_at_[hidden];
Hello David,
Thank you for taking the time to restate the scenario in your own terms — that actually comes very close to what I have in mind, and it helps me see where my earlier explanations were unclear.
1. Your restatement of the use-case
Your description with:
*
a Task type,
*
a task_list,
*
a schedule_tasks(task_list) function that reorders / updates it, and
*
an execute_task_list() loop that walks the list and calls each task
is indeed very close to the core use-case.
Conceptually, that’s exactly the pattern I’m thinking about:
void execute_task_list() {
for (auto&& task : task_list) {
task(); // T1, then T2, then T3, in some dynamic order
}
}
The key observation (which you also pointed out) is that when task_list is dynamic and updated frequently, the overhead of:
*
the loop body,
*
repeated calls,
*
and the dispatcher/scheduler structure around it
can become significant relative to the cost of the tasks themselves.
2. Clarifying the “side-effect free” comment
You are right to call out the contradiction in my earlier wording.
I did not mean that all tasks are strictly side-effect-free in the formal C++ sense. What I meant is that the sequencing mechanism itself (the “fetch-only” layer) should not have side effects — it only describes the order in which tasks are evaluated.
The tasks themselves may certainly have side effects; in many real systems they do. The order therefore does matter, but the role of the sequencing layer is to:
*
describe “T1, then T2, then T3, in this dynamically determined order”,
*
without the sequencing operations themselves performing observable work.
That was an imprecise explanation on my part, and your comment helped me see that.
3. What I am trying to explore beyond the current model
Your example of generating:
void execute_task_list() {
T1();
T2();
T3();
...
}
for a fixed schedule is exactly the kind of thing that compilers can optimize very well today for static ordering. I fully agree that with dynamic ordering, we cannot expect the same kind of aggressive static inlining / unrolling / specialization.
I am not expecting the language to magically recover “perfect” static optimization for dynamic schedules.
What I am exploring is whether C++ can:
*
make the sequencing intent explicit and separate from the dispatcher loop, so that
*
implementations have more freedom to choose different strategies for executing that sequence of tasks back-to-back, without the programmer having to encode that as a hand-written dispatcher.
In your terms, today we express sequencing as:
for (auto&& task : task_list) {
task();
}
but this is semantically a loop + function call pattern. The compiler is obligated to preserve that structure; it cannot, for example, replace it with a different internal execution strategy that still maintains the same observable semantics but removes some of the dispatcher overhead.
The “fetch-only” idea (which I clearly need to rename and formalize much better) is about:
*
treating the task order as first-class sequencing metadata,
*
keeping the actual sequencing layer side-effect-free,
*
and allowing implementations to potentially run such a sequence in a more direct way than “go back to the loop and call the next function again”.
4. Relation to your suggestions (attributes and chaining)
I agree that attributes like [[preserve_none]] and hand-written task-chaining (where each task calls the next) are powerful techniques — and, as you noted, they move into compiler-specific territory today.
Part of what I am trying to probe is whether:
*
some of these patterns,
*
especially for dynamic, event-driven task sequences,
can be expressed in a portable, abstract way in C++ so that:
*
one implementation uses plain calls,
*
another uses more aggressive control-flow chaining,
*
another might leverage a JIT or a different runtime strategy,
while keeping the same abstract semantics.
5. Answering your final question
So, to answer your closing question:
Does that describe your use-case, or come close to it?
Yes, your Task / task_list / schedule_tasks scenario describes the use-case quite well. My proposal is trying to see whether C++ can grow a way to:
*
separate the description of that dynamic schedule from its execution,
*
treat the schedule as first-class, side-effect-free sequencing metadata,
*
and give implementations room to optimize the execution of that schedule without forcing everything through a central dispatcher structure.
This is still early-stage and exploratory, and your restatement is very helpful in reframing the explanation — thank you for that.
Best Regards,
Kamalesh Lakkampally,
Founder & CEO
www.chipnadi.com <http://www.chipnadi.com>
On Thu, Dec 4, 2025 at 2:56 PM David Brown <david.brown_at_[hidden] <mailto:david.brown_at_[hidden]> > wrote:
I am still struggling to understand what you are trying to do. From
this (and your other post of around the same time), I have a better idea
of what you are /not/ trying to do - this is nothing to do with cache
management or similar.
Your "code today" example tells us very little, and does not appear to
do what you describe. It's just a task queue, and a loop that
repeatedly takes the head of the queue and executes it. There is no
scheduling as such. And your "future code" version just turns things
into magic functions, and again tells us very little.
I also note that your description in your other post seems somewhat
contradictory - if the tasks are side-effect free, then the order of
execution should not matter.
Let me try to describe what I think you might be wanting, but do so in a
somewhat different way, with slightly different names. If my guess here
is right, then this could help you write your proposal and help others
here understand what you are looking for. If I am wrong, then at least
we can rule out something more that your proposal is not about!
Let "Task" be a callable type. Perhaps it is a very simple type (a
pointer to a void-to-void function), perhaps it is more complicated.
Let "tast_list" be a container of Task's. It could be a C-style array,
a vector, a priority queue, or whatever.
You have a function "schedule_tasks(task_list)" which will re-order the
Task's in task_list (or perhaps add or remove items), according to your
dynamic scheduling needs.
Then you want to run through all these Task's in order, using something
like :
void execute_task_list() {
for (auto&& task : task_list) {
task();
}
}
Your problem here is that the body of this function, together with the
function call overhead, takes significant time in comparison to the
actual executions of the tasks. So you want to find some way to tell
the compiler that these Task's on task_list should be executed
back-to-back, without all the extra fuss of saving and restoring
registers that don't need to be restored, and that sort of thing.
I don't see how a compiler could generate code that handles this
optimally when the task order changes dynamically. When the task order
is fixed, you could generate your execute_task_list() function as :
void execute_task_list() {
T1();
T2();
T3();
...
}
The compiler can handle this as one giant inline function and optimise
to its heart's content. It is a very long time since I have worked with
HDL simulations with generated C code, and I didn't do much of it, but
in my limited experience this kind of hard-coded scheduling is what was
used. The challenges here were that you ended up with massive functions
that pushed compilers and compile-times to extremes, but that's an
implementation issue rather than a language issue.
With dynamic ordering of the tasks, you can't get such code. You are
never going to get the optimisation benefits a static list can give you.
It is likely that you can use a fair bit of compiler-specific features
to get faster code, however. If you keep your types as simple as
possible (pointer to void-to-void functions in a C-style array) and use
gcc's "preserve_none" attribute (for x86-64 and AArch64 at least) to
minimise function call overhead. Conceivably attributes like this could
be standardised in the language so that the code is not
compiler-specific, but that would be a quite different proposal.
You could also pass a pointer to the current position in the task list
to each task, and have each task call the next one on the list. That
might reduce overhead even more (these "calls" could likely be turned
into jumps). Just make sure that the final task on the list doesn't try
to call any further down the chain!
Does that describe your use-case, or come close to it?
David
On 04/12/2025 06:22, Kamalesh Lakkampally via Std-Proposals wrote:
> Hello everyone,
>
> All of the messages I sent were written by me personally. I did not
> use an LLM to generate the content. I typed everything myself, in plain
> English, doing my best to explain the concept in detail with headlines
> and with paragraph style.
>
> Thank you for the detailed questions and discussions. Let me clarify the
> concept more precisely in terms of the C++ abstract machine, because
> several responses show that my original message was unclear.
>
>
> *1. What are “fetch-only operations” conceptually?*
>
> They are *sequencing operations*, not computations.
> A fetch-only operation does *not*:
>
> *
>
> load from memory
>
> *
>
> store to memory
>
> *
>
> execute user code
>
> *
>
> allocate resources
>
> *
>
> produce side effects
>
> Instead, it carries *a piece of metadata describing what should execute
> next(i.e where control goes) *, and the evaluation order is adjusted
> accordingly.
>
>
> *2. What problem does this solve?*
>
> In event-driven execution models (e.g., hardware simulators or
> fine-grained task schedulers), programs often end up in patterns like:
>
> dispatcher();
> → T1();
> → dispatcher();
> → T2();
> → dispatcher();
> → T3();
> ...
>
> The /intent/ is simply:
>
> T1 → T2 → T3 → …
>
>
> *_but the actual control flow repeatedly returns through the dispatcher,
> causing excessive sequencing overhead. -->_ *I would be very
> interested in hearing what existing C++ techniques the committee
> considers suitable to express this kind of dynamic sequencing without
> repeatedly returning through a dispatcher.
>
>
> *3. Before / After example (as requested)*
>
> *Today:*
>
> **void scheduler() {
>
> while (!q.empty()) {
> auto t = q.pop_front();
> t(); // call and return repeatedly
> }
> }
>
>
> Abstract machine control-flow:
>
> scheduler → T1 → scheduler → T2 → scheduler → T3 → …
>
>
> With fetch-only sequencing operations:
>
> fetch_schedule(q); // purely sequencing metadata
>
> execute_scheduled(); // walks metadata in-order
>
>
> Abstract machine control-flow:
>
> T1 → T2 → T3 → …
>
> he program expresses the /intended evaluation sequence/ explicitly,
> rather than bouncing through a central dispatcher.
>
>
> *4. This is NOT hardware prefetch (clarifying misunderstanding)*
>
> There is *no*:
>
> *
>
> cache hinting
>
> *
>
> memory prefetch
>
> *
>
> cache-control
>
> *
>
> MOVNTI/MOVNTDQA behavior
>
> The similarity to CPU PREFETCH was entirely unintentional and misleading
> wording on my part.
>
>
> *5. Why not rely on compiler as-if freedom?*
>
> An optimizer /could/ theoretically flatten the dispatcher pattern, but
> in practice:
>
> *
>
> the task list is dynamic
>
> *
>
> the order changes frequently
>
> *
>
> the compiler cannot predict it
>
> This proposal explores whether C++ can provide a *mechanism to represent
> dynamic sequencing explicitly*, separate from computation.
>
>
> *6. About early-stage status*
>
> This is an early R0 exploratory draft.
> I will refine the abstract-machine model and provide a clearer formalism
> in subsequent revisions.
> I appreciate everyone’s guidance in helping me align this more closely
> with C++’s evaluation rules.
>
> Thank you again for the constructive feedback — I welcome further
> questions and suggestions.
>
>
>
> *Best Regards,
> Kamalesh Lakkampally,*
> *Founder & CEO*
> www.chipnadi.com <http://www.chipnadi.com> <http://www.chipnadi.com <http://www.chipnadi.com> >
>
>
>
> On Wed, Dec 3, 2025 at 10:18 PM Sebastian Wittmeier via Std-Proposals
> <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]> <mailto:std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]> >>
> wrote:
>
> __
>
> So it is just hints for prefetching? Just for improved performance?
>
> I thought the general wisdom about prefetching instructions is that
> their effect changes so much between processor architectures that it
> makes not much sense to put them into universal code, as in most
> cases the hardware is already better in optimization than any
> optimizer, and the programmer themselves only can beat the optimizer
> by hand-optimizing those is pure inline assembly for one specific
> architecture?
>
> But probably there is more to it?
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]> <mailto:Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]> >
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
>
>
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
Received on 2025-12-04 11:00:36
