Kamalesh,

It’s not clear that you’re looking to change anything at all in the language. If you are, you haven’t said exactly what it is.

It seems more that you have an unusual highly parallel hardware architecture, and that what you’re looking for is a very different compiler implementation, not a different or modified language.

For example, in C++ or most any procedural language, you can write some sequence of independent steps:

a = b + c;

d = e * f;

g = 3 + 5 / h;

The compiler (compiler backend typically) is totally free to reorder these, or dispatch these, in whatever way the underlying hardware architecture allows.

If there are dependencies, such as say, `k = a – 1;` the compiler ensures operations are ordered to satisfy the dependency, in whatever way the hardware architecture allows for.

So it seems already possible to express “micro-tasks”, whatever these might be, as simple independent C++ statements.

Assuming you have a novel computer hardware architecture, and you want compiler support for it, your time is very likely much better spent studying LLVM and how to port it to a new architecture, than trying to propose things to this group without understanding what you’re proposing.

You may also find different languages or language extensions (many supported by LLVM) that help targeting highly parallel hardware such as GPUs, that perhaps better fit your hardware architecture.

At least you might come out of that exercise with much better understanding of the relationship between your hardware and compilers.

(My 2 cents.)

Marc

From: Std-Proposals <std-proposals-bounces@lists.isocpp.org> On Behalf Of Kamalesh Lakkampally via Std-Proposals
Sent: Wednesday, December 3, 2025 21:46
To: std-proposals@lists.isocpp.org
Cc: Kamalesh Lakkampally <info@chipnadi.com>
Subject: Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics

Hello Thiago,

Thank you for the thoughtful question — it touches on the central issue of whether this idea belongs in the C++ Standard or should remain an architecture-specific extension.
Let me clarify the motivation more precisely, because the original message did not convey the full context.

1. This proposal is not about prefetching or cache-control

The draft text unfortunately used terminology that sounded hardware-adjacent, but the intent is not to introduce anything analogous to:

prefetch intrinsics
non-temporal loads/stores
cache control hints
pipeline fetch controls

Those are ISA-level optimizations and naturally belong in compiler extensions or architecture-specific intrinsics.

The concept here is completely different and exists at the semantic level, not the CPU-microarchitecture level.

2. The actual concept: explicit sequencing of dynamically changing micro-tasks

Many event-driven systems (HDL simulators are just one example) share a common execution model:

Thousands of micro-tasks
The order of tasks is computed dynamically
The order changes every cycle or even more frequently
Tasks themselves are small and often side-effect-free
A dispatcher function repeatedly selects the next task to run

In C++ today, this typically results in a control-flow pattern like:

dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …

even when the intended evaluation sequence is conceptually:

T1 → T2 → T3 → T4 → …

The key issue is expressiveness: C++ currently has no mechanism to express “evaluate these things in this dynamic order, without re-entering a dispatcher between each step”.

Coroutines, tasks, executors, thread pools, and dispatch loops all still fundamentally operate through:

repeated function calls, or
repeated returns to a central controller, or
runtime-managed schedulers

which means that as programs scale down to extremely fine-grained tasks, sequencing overhead becomes dominant, even on architectures where prefetching is not a concern.

3. Why this is not architecture-specific

The misunderstanding arises when “fetch-only operation” sounds like a CPU fetch-stage mechanism.

The actual idea is:

A mechanism in the abstract machine that allows a program to express a sequence of evaluations that does not pass through a central dispatcher after each step.

This can be implemented portably in many ways:

Using a software-run interpreter that consumes the sequencing structure
Using an implementation-specific optimization when available
Mapping to architecture-specific controls on processors that support such mechanisms
Or completely lowering into normal calls on simpler hardware

This is similar in spirit to:

coroutines: abstract machine semantics, many possible implementations
atomics: abstract semantics, architecture maps them to whatever operations it has
SIMD types: portable semantics, mapped to different instructions per architecture
executors: abstract relationships, many possible backends

So the goal is not to standardize “fetch-stage hardware instructions”,
but to explore whether C++ can expose a portable semantic form of sequencing that compilers may optimize very differently depending on platform capabilities.

4. Why prefetch intrinsics are insufficient

Prefetch intrinsics provide:

data locality hints
non-temporal load/store hints
per-architecture micro-optimizations

They do not provide:

a semantic representation of evaluation order
a way to represent a dynamically computed schedule
a portable abstraction
a way to eliminate dispatcher re-entry in the program model

Prefetching does not remove the repeated calls/returns between tasks.
This proposal focuses on expressing intent, not on cache behavior.

5. Why this might belong in the Standard

Because the idea is:

semantic, not architectural
portable across CPU, GPU, and FPGA-style systems
potentially optimizable by compilers
relevant to domains beyond HDL tools
conceptually related to execution agents, coroutines, and sequencing in executors
about exposing user intent (dynamic ordering), not hardware control

Many modern workloads — including simulation, actor frameworks, reactive graph engines, and fine-grained schedulers — could benefit from a portable way to express:

“Here is the next unit of evaluation; no need to return to a dispatcher.”

even if the implementation varies drastically per target platform.

6. Early-stage nature

This is still an early R0 exploratory draft.
I fully expect that the idea will require:

reframing in abstract-machine terminology
better examples
clarification of how sequencing is expressed
exploration of implementability on real compilers

I appreciate your question because it helps anchor the discussion in the right conceptual layer.

Thank you again for engaging — your perspective is extremely valuable.

On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via Std-Proposals <std-proposals@lists.isocpp.org> wrote:

On Wednesday, 3 December 2025 03:38:57 Pacific Standard Time Kamalesh
Lakkampally via Std-Proposals wrote:
> The goal is to support workloads where the execution order of micro-tasks
> changes dynamically and unpredictably every cycle, such as *event-driven
> HDL/SystemVerilog simulation*.
> In such environments, conventional C++ mechanisms (threads, coroutines,
> futures, indirect calls, executors) incur significant pipeline redirection
> penalties. Fetch-only instructions aim to address this problem in a
> structured, language-visible way.
>
> I would greatly appreciate *feedback, criticism, and suggestions* from the
> community.
> I am also *open to collaboration.*

Can you explain why this should be in the Standard? Why are the prefetch
intrinsics available as compiler extensions for a lot of architectures not
enough? My first reaction is that this type of design is going to be very
architecture-specific by definition, so using architecture-specific extensions
should not be an impediment.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel Data Center - Platform & Sys. Eng.
--
Std-Proposals mailing list
Std-Proposals@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals