Date: Thu, 4 Dec 2025 10:26:35 +0100
I am still struggling to understand what you are trying to do. From
this (and your other post of around the same time), I have a better idea
of what you are /not/ trying to do - this is nothing to do with cache
management or similar.
Your "code today" example tells us very little, and does not appear to
do what you describe. It's just a task queue, and a loop that
repeatedly takes the head of the queue and executes it. There is no
scheduling as such. And your "future code" version just turns things
into magic functions, and again tells us very little.
I also note that your description in your other post seems somewhat
contradictory - if the tasks are side-effect free, then the order of
execution should not matter.
Let me try to describe what I think you might be wanting, but do so in a
somewhat different way, with slightly different names. If my guess here
is right, then this could help you write your proposal and help others
here understand what you are looking for. If I am wrong, then at least
we can rule out something more that your proposal is not about!
Let "Task" be a callable type. Perhaps it is a very simple type (a
pointer to a void-to-void function), perhaps it is more complicated.
Let "tast_list" be a container of Task's. It could be a C-style array,
a vector, a priority queue, or whatever.
You have a function "schedule_tasks(task_list)" which will re-order the
Task's in task_list (or perhaps add or remove items), according to your
dynamic scheduling needs.
Then you want to run through all these Task's in order, using something
like :
void execute_task_list() {
for (auto&& task : task_list) {
task();
}
}
Your problem here is that the body of this function, together with the
function call overhead, takes significant time in comparison to the
actual executions of the tasks. So you want to find some way to tell
the compiler that these Task's on task_list should be executed
back-to-back, without all the extra fuss of saving and restoring
registers that don't need to be restored, and that sort of thing.
I don't see how a compiler could generate code that handles this
optimally when the task order changes dynamically. When the task order
is fixed, you could generate your execute_task_list() function as :
void execute_task_list() {
T1();
T2();
T3();
...
}
The compiler can handle this as one giant inline function and optimise
to its heart's content. It is a very long time since I have worked with
HDL simulations with generated C code, and I didn't do much of it, but
in my limited experience this kind of hard-coded scheduling is what was
used. The challenges here were that you ended up with massive functions
that pushed compilers and compile-times to extremes, but that's an
implementation issue rather than a language issue.
With dynamic ordering of the tasks, you can't get such code. You are
never going to get the optimisation benefits a static list can give you.
It is likely that you can use a fair bit of compiler-specific features
to get faster code, however. If you keep your types as simple as
possible (pointer to void-to-void functions in a C-style array) and use
gcc's "preserve_none" attribute (for x86-64 and AArch64 at least) to
minimise function call overhead. Conceivably attributes like this could
be standardised in the language so that the code is not
compiler-specific, but that would be a quite different proposal.
You could also pass a pointer to the current position in the task list
to each task, and have each task call the next one on the list. That
might reduce overhead even more (these "calls" could likely be turned
into jumps). Just make sure that the final task on the list doesn't try
to call any further down the chain!
Does that describe your use-case, or come close to it?
David
On 04/12/2025 06:22, Kamalesh Lakkampally via Std-Proposals wrote:
> Hello everyone,
>
> All of the messages I sent were written by me personally. I did not
> use an LLM to generate the content. I typed everything myself, in plain
> English, doing my best to explain the concept in detail with headlines
> and with paragraph style.
>
> Thank you for the detailed questions and discussions. Let me clarify the
> concept more precisely in terms of the C++ abstract machine, because
> several responses show that my original message was unclear.
>
>
> *1. What are “fetch-only operations” conceptually?*
>
> They are *sequencing operations*, not computations.
> A fetch-only operation does *not*:
>
> *
>
> load from memory
>
> *
>
> store to memory
>
> *
>
> execute user code
>
> *
>
> allocate resources
>
> *
>
> produce side effects
>
> Instead, it carries *a piece of metadata describing what should execute
> next(i.e where control goes) *, and the evaluation order is adjusted
> accordingly.
>
>
> *2. What problem does this solve?*
>
> In event-driven execution models (e.g., hardware simulators or
> fine-grained task schedulers), programs often end up in patterns like:
>
> dispatcher();
> → T1();
> → dispatcher();
> → T2();
> → dispatcher();
> → T3();
> ...
>
> The /intent/ is simply:
>
> T1 → T2 → T3 → …
>
>
> *_but the actual control flow repeatedly returns through the dispatcher,
> causing excessive sequencing overhead. -->_ *I would be very
> interested in hearing what existing C++ techniques the committee
> considers suitable to express this kind of dynamic sequencing without
> repeatedly returning through a dispatcher.
>
>
> *3. Before / After example (as requested)*
>
> *Today:*
>
> **void scheduler() {
>
> while (!q.empty()) {
> auto t = q.pop_front();
> t(); // call and return repeatedly
> }
> }
>
>
> Abstract machine control-flow:
>
> scheduler → T1 → scheduler → T2 → scheduler → T3 → …
>
>
> With fetch-only sequencing operations:
>
> fetch_schedule(q); // purely sequencing metadata
>
> execute_scheduled(); // walks metadata in-order
>
>
> Abstract machine control-flow:
>
> T1 → T2 → T3 → …
>
> he program expresses the /intended evaluation sequence/ explicitly,
> rather than bouncing through a central dispatcher.
>
>
> *4. This is NOT hardware prefetch (clarifying misunderstanding)*
>
> There is *no*:
>
> *
>
> cache hinting
>
> *
>
> memory prefetch
>
> *
>
> cache-control
>
> *
>
> MOVNTI/MOVNTDQA behavior
>
> The similarity to CPU PREFETCH was entirely unintentional and misleading
> wording on my part.
>
>
> *5. Why not rely on compiler as-if freedom?*
>
> An optimizer /could/ theoretically flatten the dispatcher pattern, but
> in practice:
>
> *
>
> the task list is dynamic
>
> *
>
> the order changes frequently
>
> *
>
> the compiler cannot predict it
>
> This proposal explores whether C++ can provide a *mechanism to represent
> dynamic sequencing explicitly*, separate from computation.
>
>
> *6. About early-stage status*
>
> This is an early R0 exploratory draft.
> I will refine the abstract-machine model and provide a clearer formalism
> in subsequent revisions.
> I appreciate everyone’s guidance in helping me align this more closely
> with C++’s evaluation rules.
>
> Thank you again for the constructive feedback — I welcome further
> questions and suggestions.
>
>
>
> *Best Regards,
> Kamalesh Lakkampally,*
> *Founder & CEO*
> www.chipnadi.com <http://www.chipnadi.com>
>
>
>
> On Wed, Dec 3, 2025 at 10:18 PM Sebastian Wittmeier via Std-Proposals
> <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>>
> wrote:
>
> __
>
> So it is just hints for prefetching? Just for improved performance?
>
> I thought the general wisdom about prefetching instructions is that
> their effect changes so much between processor architectures that it
> makes not much sense to put them into universal code, as in most
> cases the hardware is already better in optimization than any
> optimizer, and the programmer themselves only can beat the optimizer
> by hand-optimizing those is pure inline assembly for one specific
> architecture?
>
> But probably there is more to it?
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
>
>
this (and your other post of around the same time), I have a better idea
of what you are /not/ trying to do - this is nothing to do with cache
management or similar.
Your "code today" example tells us very little, and does not appear to
do what you describe. It's just a task queue, and a loop that
repeatedly takes the head of the queue and executes it. There is no
scheduling as such. And your "future code" version just turns things
into magic functions, and again tells us very little.
I also note that your description in your other post seems somewhat
contradictory - if the tasks are side-effect free, then the order of
execution should not matter.
Let me try to describe what I think you might be wanting, but do so in a
somewhat different way, with slightly different names. If my guess here
is right, then this could help you write your proposal and help others
here understand what you are looking for. If I am wrong, then at least
we can rule out something more that your proposal is not about!
Let "Task" be a callable type. Perhaps it is a very simple type (a
pointer to a void-to-void function), perhaps it is more complicated.
Let "tast_list" be a container of Task's. It could be a C-style array,
a vector, a priority queue, or whatever.
You have a function "schedule_tasks(task_list)" which will re-order the
Task's in task_list (or perhaps add or remove items), according to your
dynamic scheduling needs.
Then you want to run through all these Task's in order, using something
like :
void execute_task_list() {
for (auto&& task : task_list) {
task();
}
}
Your problem here is that the body of this function, together with the
function call overhead, takes significant time in comparison to the
actual executions of the tasks. So you want to find some way to tell
the compiler that these Task's on task_list should be executed
back-to-back, without all the extra fuss of saving and restoring
registers that don't need to be restored, and that sort of thing.
I don't see how a compiler could generate code that handles this
optimally when the task order changes dynamically. When the task order
is fixed, you could generate your execute_task_list() function as :
void execute_task_list() {
T1();
T2();
T3();
...
}
The compiler can handle this as one giant inline function and optimise
to its heart's content. It is a very long time since I have worked with
HDL simulations with generated C code, and I didn't do much of it, but
in my limited experience this kind of hard-coded scheduling is what was
used. The challenges here were that you ended up with massive functions
that pushed compilers and compile-times to extremes, but that's an
implementation issue rather than a language issue.
With dynamic ordering of the tasks, you can't get such code. You are
never going to get the optimisation benefits a static list can give you.
It is likely that you can use a fair bit of compiler-specific features
to get faster code, however. If you keep your types as simple as
possible (pointer to void-to-void functions in a C-style array) and use
gcc's "preserve_none" attribute (for x86-64 and AArch64 at least) to
minimise function call overhead. Conceivably attributes like this could
be standardised in the language so that the code is not
compiler-specific, but that would be a quite different proposal.
You could also pass a pointer to the current position in the task list
to each task, and have each task call the next one on the list. That
might reduce overhead even more (these "calls" could likely be turned
into jumps). Just make sure that the final task on the list doesn't try
to call any further down the chain!
Does that describe your use-case, or come close to it?
David
On 04/12/2025 06:22, Kamalesh Lakkampally via Std-Proposals wrote:
> Hello everyone,
>
> All of the messages I sent were written by me personally. I did not
> use an LLM to generate the content. I typed everything myself, in plain
> English, doing my best to explain the concept in detail with headlines
> and with paragraph style.
>
> Thank you for the detailed questions and discussions. Let me clarify the
> concept more precisely in terms of the C++ abstract machine, because
> several responses show that my original message was unclear.
>
>
> *1. What are “fetch-only operations” conceptually?*
>
> They are *sequencing operations*, not computations.
> A fetch-only operation does *not*:
>
> *
>
> load from memory
>
> *
>
> store to memory
>
> *
>
> execute user code
>
> *
>
> allocate resources
>
> *
>
> produce side effects
>
> Instead, it carries *a piece of metadata describing what should execute
> next(i.e where control goes) *, and the evaluation order is adjusted
> accordingly.
>
>
> *2. What problem does this solve?*
>
> In event-driven execution models (e.g., hardware simulators or
> fine-grained task schedulers), programs often end up in patterns like:
>
> dispatcher();
> → T1();
> → dispatcher();
> → T2();
> → dispatcher();
> → T3();
> ...
>
> The /intent/ is simply:
>
> T1 → T2 → T3 → …
>
>
> *_but the actual control flow repeatedly returns through the dispatcher,
> causing excessive sequencing overhead. -->_ *I would be very
> interested in hearing what existing C++ techniques the committee
> considers suitable to express this kind of dynamic sequencing without
> repeatedly returning through a dispatcher.
>
>
> *3. Before / After example (as requested)*
>
> *Today:*
>
> **void scheduler() {
>
> while (!q.empty()) {
> auto t = q.pop_front();
> t(); // call and return repeatedly
> }
> }
>
>
> Abstract machine control-flow:
>
> scheduler → T1 → scheduler → T2 → scheduler → T3 → …
>
>
> With fetch-only sequencing operations:
>
> fetch_schedule(q); // purely sequencing metadata
>
> execute_scheduled(); // walks metadata in-order
>
>
> Abstract machine control-flow:
>
> T1 → T2 → T3 → …
>
> he program expresses the /intended evaluation sequence/ explicitly,
> rather than bouncing through a central dispatcher.
>
>
> *4. This is NOT hardware prefetch (clarifying misunderstanding)*
>
> There is *no*:
>
> *
>
> cache hinting
>
> *
>
> memory prefetch
>
> *
>
> cache-control
>
> *
>
> MOVNTI/MOVNTDQA behavior
>
> The similarity to CPU PREFETCH was entirely unintentional and misleading
> wording on my part.
>
>
> *5. Why not rely on compiler as-if freedom?*
>
> An optimizer /could/ theoretically flatten the dispatcher pattern, but
> in practice:
>
> *
>
> the task list is dynamic
>
> *
>
> the order changes frequently
>
> *
>
> the compiler cannot predict it
>
> This proposal explores whether C++ can provide a *mechanism to represent
> dynamic sequencing explicitly*, separate from computation.
>
>
> *6. About early-stage status*
>
> This is an early R0 exploratory draft.
> I will refine the abstract-machine model and provide a clearer formalism
> in subsequent revisions.
> I appreciate everyone’s guidance in helping me align this more closely
> with C++’s evaluation rules.
>
> Thank you again for the constructive feedback — I welcome further
> questions and suggestions.
>
>
>
> *Best Regards,
> Kamalesh Lakkampally,*
> *Founder & CEO*
> www.chipnadi.com <http://www.chipnadi.com>
>
>
>
> On Wed, Dec 3, 2025 at 10:18 PM Sebastian Wittmeier via Std-Proposals
> <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>>
> wrote:
>
> __
>
> So it is just hints for prefetching? Just for improved performance?
>
> I thought the general wisdom about prefetching instructions is that
> their effect changes so much between processor architectures that it
> makes not much sense to put them into universal code, as in most
> cases the hardware is already better in optimization than any
> optimizer, and the programmer themselves only can beat the optimizer
> by hand-optimizing those is pure inline assembly for one specific
> architecture?
>
> But probably there is more to it?
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
>
>
Received on 2025-12-04 09:26:44
