Date: Thu, 4 Dec 2025 06:19:31 +0000
Kamalesh,
It’s not clear that you’re looking to change anything at all in the language. If you are, you haven’t said exactly what it is.
It seems more that you have an unusual highly parallel hardware architecture, and that what you’re looking for is a very different compiler implementation, not a different or modified language.
For example, in C++ or most any procedural language, you can write some sequence of independent steps:
a = b + c;
d = e * f;
g = 3 + 5 / h;
The compiler (compiler backend typically) is totally free to reorder these, or dispatch these, in whatever way the underlying hardware architecture allows.
If there are dependencies, such as say, `k = a – 1;` the compiler ensures operations are ordered to satisfy the dependency, in whatever way the hardware architecture allows for.
So it seems already possible to express “micro-tasks”, whatever these might be, as simple independent C++ statements.
Assuming you have a novel computer hardware architecture, and you want compiler support for it, your time is very likely much better spent studying LLVM and how to port it to a new architecture, than trying to propose things to this group without understanding what you’re proposing.
You may also find different languages or language extensions (many supported by LLVM) that help targeting highly parallel hardware such as GPUs, that perhaps better fit your hardware architecture.
At least you might come out of that exercise with much better understanding of the relationship between your hardware and compilers.
(My 2 cents.)
Marc
From: Std-Proposals <std-proposals-bounces_at_[hidden]> On Behalf Of Kamalesh Lakkampally via Std-Proposals
Sent: Wednesday, December 3, 2025 21:46
To: std-proposals_at_[hidden]
Cc: Kamalesh Lakkampally <info_at_[hidden]>
Subject: Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics
Hello Thiago,
Thank you for the thoughtful question — it touches on the central issue of whether this idea belongs in the C++ Standard or should remain an architecture-specific extension.
Let me clarify the motivation more precisely, because the original message did not convey the full context.
1. This proposal is not about prefetching or cache-control
The draft text unfortunately used terminology that sounded hardware-adjacent, but the intent is not to introduce anything analogous to:
* prefetch intrinsics
* non-temporal loads/stores
* cache control hints
* pipeline fetch controls
Those are ISA-level optimizations and naturally belong in compiler extensions or architecture-specific intrinsics.
The concept here is completely different and exists at the semantic level, not the CPU-microarchitecture level.
2. The actual concept: explicit sequencing of dynamically changing micro-tasks
Many event-driven systems (HDL simulators are just one example) share a common execution model:
* Thousands of micro-tasks
* The order of tasks is computed dynamically
* The order changes every cycle or even more frequently
* Tasks themselves are small and often side-effect-free
* A dispatcher function repeatedly selects the next task to run
In C++ today, this typically results in a control-flow pattern like:
dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …
even when the intended evaluation sequence is conceptually:
T1 → T2 → T3 → T4 → …
The key issue is expressiveness: C++ currently has no mechanism to express “evaluate these things in this dynamic order, without re-entering a dispatcher between each step”.
Coroutines, tasks, executors, thread pools, and dispatch loops all still fundamentally operate through:
* repeated function calls, or
* repeated returns to a central controller, or
* runtime-managed schedulers
which means that as programs scale down to extremely fine-grained tasks, sequencing overhead becomes dominant, even on architectures where prefetching is not a concern.
3. Why this is not architecture-specific
The misunderstanding arises when “fetch-only operation” sounds like a CPU fetch-stage mechanism.
The actual idea is:
A mechanism in the abstract machine that allows a program to express a sequence of evaluations that does not pass through a central dispatcher after each step.
This can be implemented portably in many ways:
* Using a software-run interpreter that consumes the sequencing structure
* Using an implementation-specific optimization when available
* Mapping to architecture-specific controls on processors that support such mechanisms
* Or completely lowering into normal calls on simpler hardware
This is similar in spirit to:
* coroutines: abstract machine semantics, many possible implementations
* atomics: abstract semantics, architecture maps them to whatever operations it has
* SIMD types: portable semantics, mapped to different instructions per architecture
* executors: abstract relationships, many possible backends
So the goal is not to standardize “fetch-stage hardware instructions”,
but to explore whether C++ can expose a portable semantic form of sequencing that compilers may optimize very differently depending on platform capabilities.
4. Why prefetch intrinsics are insufficient
Prefetch intrinsics provide:
* data locality hints
* non-temporal load/store hints
* per-architecture micro-optimizations
They do not provide:
* a semantic representation of evaluation order
* a way to represent a dynamically computed schedule
* a portable abstraction
* a way to eliminate dispatcher re-entry in the program model
Prefetching does not remove the repeated calls/returns between tasks.
This proposal focuses on expressing intent, not on cache behavior.
5. Why this might belong in the Standard
Because the idea is:
* semantic, not architectural
* portable across CPU, GPU, and FPGA-style systems
* potentially optimizable by compilers
* relevant to domains beyond HDL tools
* conceptually related to execution agents, coroutines, and sequencing in executors
* about exposing user intent (dynamic ordering), not hardware control
Many modern workloads — including simulation, actor frameworks, reactive graph engines, and fine-grained schedulers — could benefit from a portable way to express:
“Here is the next unit of evaluation; no need to return to a dispatcher.”
even if the implementation varies drastically per target platform.
6. Early-stage nature
This is still an early R0 exploratory draft.
I fully expect that the idea will require:
* reframing in abstract-machine terminology
* better examples
* clarification of how sequencing is expressed
* exploration of implementability on real compilers
I appreciate your question because it helps anchor the discussion in the right conceptual layer.
Thank you again for engaging — your perspective is extremely valuable.
On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via Std-Proposals <std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>> wrote:
On Wednesday, 3 December 2025 03:38:57 Pacific Standard Time Kamalesh
Lakkampally via Std-Proposals wrote:
> The goal is to support workloads where the execution order of micro-tasks
> changes dynamically and unpredictably every cycle, such as *event-driven
> HDL/SystemVerilog simulation*.
> In such environments, conventional C++ mechanisms (threads, coroutines,
> futures, indirect calls, executors) incur significant pipeline redirection
> penalties. Fetch-only instructions aim to address this problem in a
> structured, language-visible way.
>
> I would greatly appreciate *feedback, criticism, and suggestions* from the
> community.
> I am also *open to collaboration.*
Can you explain why this should be in the Standard? Why are the prefetch
intrinsics available as compiler extensions for a lot of architectures not
enough? My first reaction is that this type of design is going to be very
architecture-specific by definition, so using architecture-specific extensions
should not be an impediment.
--
Thiago Macieira - thiago (AT) macieira.info<https://urldefense.us/v3/__http:/macieira.info__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfiyjPrE3A$> - thiago (AT) kde.org<https://urldefense.us/v3/__http:/kde.org__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfhL3Q_pjg$>
Principal Engineer - Intel Data Center - Platform & Sys. Eng.
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]ocpp.org<mailto:Std-Proposals_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals<https://urldefense.us/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/std-proposals__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfipwWQT2w$>
It’s not clear that you’re looking to change anything at all in the language. If you are, you haven’t said exactly what it is.
It seems more that you have an unusual highly parallel hardware architecture, and that what you’re looking for is a very different compiler implementation, not a different or modified language.
For example, in C++ or most any procedural language, you can write some sequence of independent steps:
a = b + c;
d = e * f;
g = 3 + 5 / h;
The compiler (compiler backend typically) is totally free to reorder these, or dispatch these, in whatever way the underlying hardware architecture allows.
If there are dependencies, such as say, `k = a – 1;` the compiler ensures operations are ordered to satisfy the dependency, in whatever way the hardware architecture allows for.
So it seems already possible to express “micro-tasks”, whatever these might be, as simple independent C++ statements.
Assuming you have a novel computer hardware architecture, and you want compiler support for it, your time is very likely much better spent studying LLVM and how to port it to a new architecture, than trying to propose things to this group without understanding what you’re proposing.
You may also find different languages or language extensions (many supported by LLVM) that help targeting highly parallel hardware such as GPUs, that perhaps better fit your hardware architecture.
At least you might come out of that exercise with much better understanding of the relationship between your hardware and compilers.
(My 2 cents.)
Marc
From: Std-Proposals <std-proposals-bounces_at_[hidden]> On Behalf Of Kamalesh Lakkampally via Std-Proposals
Sent: Wednesday, December 3, 2025 21:46
To: std-proposals_at_[hidden]
Cc: Kamalesh Lakkampally <info_at_[hidden]>
Subject: Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics
Hello Thiago,
Thank you for the thoughtful question — it touches on the central issue of whether this idea belongs in the C++ Standard or should remain an architecture-specific extension.
Let me clarify the motivation more precisely, because the original message did not convey the full context.
1. This proposal is not about prefetching or cache-control
The draft text unfortunately used terminology that sounded hardware-adjacent, but the intent is not to introduce anything analogous to:
* prefetch intrinsics
* non-temporal loads/stores
* cache control hints
* pipeline fetch controls
Those are ISA-level optimizations and naturally belong in compiler extensions or architecture-specific intrinsics.
The concept here is completely different and exists at the semantic level, not the CPU-microarchitecture level.
2. The actual concept: explicit sequencing of dynamically changing micro-tasks
Many event-driven systems (HDL simulators are just one example) share a common execution model:
* Thousands of micro-tasks
* The order of tasks is computed dynamically
* The order changes every cycle or even more frequently
* Tasks themselves are small and often side-effect-free
* A dispatcher function repeatedly selects the next task to run
In C++ today, this typically results in a control-flow pattern like:
dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …
even when the intended evaluation sequence is conceptually:
T1 → T2 → T3 → T4 → …
The key issue is expressiveness: C++ currently has no mechanism to express “evaluate these things in this dynamic order, without re-entering a dispatcher between each step”.
Coroutines, tasks, executors, thread pools, and dispatch loops all still fundamentally operate through:
* repeated function calls, or
* repeated returns to a central controller, or
* runtime-managed schedulers
which means that as programs scale down to extremely fine-grained tasks, sequencing overhead becomes dominant, even on architectures where prefetching is not a concern.
3. Why this is not architecture-specific
The misunderstanding arises when “fetch-only operation” sounds like a CPU fetch-stage mechanism.
The actual idea is:
A mechanism in the abstract machine that allows a program to express a sequence of evaluations that does not pass through a central dispatcher after each step.
This can be implemented portably in many ways:
* Using a software-run interpreter that consumes the sequencing structure
* Using an implementation-specific optimization when available
* Mapping to architecture-specific controls on processors that support such mechanisms
* Or completely lowering into normal calls on simpler hardware
This is similar in spirit to:
* coroutines: abstract machine semantics, many possible implementations
* atomics: abstract semantics, architecture maps them to whatever operations it has
* SIMD types: portable semantics, mapped to different instructions per architecture
* executors: abstract relationships, many possible backends
So the goal is not to standardize “fetch-stage hardware instructions”,
but to explore whether C++ can expose a portable semantic form of sequencing that compilers may optimize very differently depending on platform capabilities.
4. Why prefetch intrinsics are insufficient
Prefetch intrinsics provide:
* data locality hints
* non-temporal load/store hints
* per-architecture micro-optimizations
They do not provide:
* a semantic representation of evaluation order
* a way to represent a dynamically computed schedule
* a portable abstraction
* a way to eliminate dispatcher re-entry in the program model
Prefetching does not remove the repeated calls/returns between tasks.
This proposal focuses on expressing intent, not on cache behavior.
5. Why this might belong in the Standard
Because the idea is:
* semantic, not architectural
* portable across CPU, GPU, and FPGA-style systems
* potentially optimizable by compilers
* relevant to domains beyond HDL tools
* conceptually related to execution agents, coroutines, and sequencing in executors
* about exposing user intent (dynamic ordering), not hardware control
Many modern workloads — including simulation, actor frameworks, reactive graph engines, and fine-grained schedulers — could benefit from a portable way to express:
“Here is the next unit of evaluation; no need to return to a dispatcher.”
even if the implementation varies drastically per target platform.
6. Early-stage nature
This is still an early R0 exploratory draft.
I fully expect that the idea will require:
* reframing in abstract-machine terminology
* better examples
* clarification of how sequencing is expressed
* exploration of implementability on real compilers
I appreciate your question because it helps anchor the discussion in the right conceptual layer.
Thank you again for engaging — your perspective is extremely valuable.
On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via Std-Proposals <std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>> wrote:
On Wednesday, 3 December 2025 03:38:57 Pacific Standard Time Kamalesh
Lakkampally via Std-Proposals wrote:
> The goal is to support workloads where the execution order of micro-tasks
> changes dynamically and unpredictably every cycle, such as *event-driven
> HDL/SystemVerilog simulation*.
> In such environments, conventional C++ mechanisms (threads, coroutines,
> futures, indirect calls, executors) incur significant pipeline redirection
> penalties. Fetch-only instructions aim to address this problem in a
> structured, language-visible way.
>
> I would greatly appreciate *feedback, criticism, and suggestions* from the
> community.
> I am also *open to collaboration.*
Can you explain why this should be in the Standard? Why are the prefetch
intrinsics available as compiler extensions for a lot of architectures not
enough? My first reaction is that this type of design is going to be very
architecture-specific by definition, so using architecture-specific extensions
should not be an impediment.
--
Thiago Macieira - thiago (AT) macieira.info<https://urldefense.us/v3/__http:/macieira.info__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfiyjPrE3A$> - thiago (AT) kde.org<https://urldefense.us/v3/__http:/kde.org__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfhL3Q_pjg$>
Principal Engineer - Intel Data Center - Platform & Sys. Eng.
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]ocpp.org<mailto:Std-Proposals_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals<https://urldefense.us/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/std-proposals__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfipwWQT2w$>
Received on 2025-12-04 06:19:40
