Hi Marc,

Thank you for your comments. Since the mailing list strips attachments, you have not seen the core details of the idea, so let me restate the proposal from the beginning.

1. What the proposal is NOT

It is not:

a new CPU instruction
a hardware prefetching mechanism
a cache hint
a pipeline control mechanism
an optimization directive
a request for compiler backend changes
a form of parallelism or GPU-style execution

None of these describe the concept accurately.

The proposal operates purely at the C++ abstract-machine level.

2. What the proposal is: A new way to express dynamic sequencing of micro-tasks

Many event-driven systems maintain a queue of very small tasks ("micro-tasks") whose execution order changes frequently at runtime:

At one moment: T1 → T2 → T3
Later: T3 → T4 → T1 → T2

In C++ today, these systems must route control through a dispatcher:

dispatcher();
→ T1();
dispatcher();
→ T2();
dispatcher();
→ T3();

Even though the intended program order is simply:

T1 → T2 → T3

This repeated dispatcher → task → dispatcher pattern:

is semantically unnecessary
consumes execution bandwidth
prevents compiler optimization
introduces unpredictable control flow
creates overhead for extremely fine-grained tasks

The proposal asks:

Can C++ express dynamic sequencing declaratively, without requiring the program to re-enter the dispatcher between every micro-task?

3. The key idea: “Fetch-only operations”

A fetch-only operation is a C++ semantic construct that:

✔ does NOT compute

✔ does NOT read or write memory

✔ does NOT have observable side effects

✔ does NOT correspond to a function call or branch

✔ EXISTS ONLY to describe “what executes next”

In other words, it is a pure sequencing directive, not an instruction or computation.

For example (placeholder syntax):


fad q[i] = next_address;
fcd q[i] = thread_context;
fed q[i] = exec_context;

These operations place sequencing metadata into a dedicated structure.

4. What metadata is being represented?

Each “micro-task” is associated with small context fields:

8-bit thread-context
8-bit execution-context
an instruction address (or function entry)

These fields allow the program to encode:

“After completing this task, the next task to evaluate is at address X,
but only if its context matches Y.”

This enables the expression of dynamic scheduling decisions without returning through the dispatcher.

5. Fetch-Only Region: where this metadata lives

Just as C++ programs conceptually have:

a stack region (automatic storage)
a heap region (dynamic storage)

the proposal introduces a fetch-only region:

memory that stores sequencing metadata
strictly controlled by the implementation--> MMU(memory management unit)
with context validation
not accessible for ordinary loads/stores
used only for sequencing, not computation

This region is not hardware-specific; it is an abstract-machine concept, much like the thread-local storage model or atomic synchronization regions.

6. Why this belongs at the language level

This is not expressible today because C++ has no construct to describe sequencing separately from execution.

Existing mechanisms:

Threads / tasks / executors

→ Require execution-path transitions

Coroutines

→ Maintain suspended execution frames

Function calls

→ Require call/return semantics

Dispatch loops

→ Centralize sequencing and exhibit overhead

As-if rule

→ Cannot remove dispatcher calls; they are semantically required

Dynamic sequencing is fundamentally:

not parallelism,
not computation,
not scheduling,
not hardware control.

It is language-level intent:

“Evaluate micro-tasks in this dynamic order, without routing control back through a dispatcher.”

This cannot be expressed with intrinsics, LLVM passes, or target-specific code.

7. Why this is portable

Different compilers/runtimes/architectures may choose different implementations:

pure software interpreter for sequencing
optimizing JIT
coroutine resumption graph
architecture-specific fast paths (optional)
or normal dispatch loops as fallback

The semantics remain:

Fetch-only operations provide sequencing metadata
fetch-only region stores it
execution proceeds in the declared order.

This is a valid addition to the C++ abstract machine, not a hardware feature.

Why context fields matter

Context metadata enables:

correct sequencing: ensuring that only valid successor tasks are chosen
safety checks: preventing unintended jumps to unrelated micro-tasks
structural integrity: maintaining a well-defined evaluation graph

Security aspect

Because the sequencing structure is stored in a dedicated fetch-only region—conceptually similar to the way the stack and heap represent distinct memory roles—the context fields also allow:

validation of allowed transitions,
prevention of unauthorized or accidental modification, and
protection against control-flow corruption.

In other words, the combination of:

context identifiers, and
a dedicated fetch-only region (analogous to stack/heap regions)

provides a framework in which implementations can enforce both correctness and security properties for dynamic sequencing.

This occurs entirely at the semantic level; no explicit hardware behavior is assumed.

8. Why the proposal exists

Highly dynamic micro-task workloads (event-driven simulation is just one example) cannot currently express their intent in C++ without:

repeated dispatcher calls
unnecessary control-flow redirections
significant overhead for fine-grained scheduling

This proposal explores whether C++ can support such workloads in a portable and declarative manner.

9. Still early-stage

This is an R0 exploratory draft. I am refining terminology, abstract-machine semantics, and examples with the help of feedback from this discussion.

Best Regards,
Kamalesh Lakkampally,

Founder & CEO

www.chipnadi.com

On Thu, Dec 4, 2025 at 11:49 AM Marc Edouard Gauthier via Std-Proposals <std-proposals@lists.isocpp.org> wrote:

Kamalesh,

It’s not clear that you’re looking to change anything at all in the language. If you are, you haven’t said exactly what it is.

It seems more that you have an unusual highly parallel hardware architecture, and that what you’re looking for is a very different compiler implementation, not a different or modified language.

For example, in C++ or most any procedural language, you can write some sequence of independent steps:

    a = b + c;

    d = e * f;

    g = 3 + 5 / h;

The compiler (compiler backend typically) is totally free to reorder these, or dispatch these, in whatever way the underlying hardware architecture allows.

If there are dependencies, such as say, `k = a – 1;` the compiler ensures operations are ordered to satisfy the dependency, in whatever way the hardware architecture allows for.

So it seems already possible to express “micro-tasks”, whatever these might be, as simple independent C++ statements.

Assuming you have a novel computer hardware architecture, and you want compiler support for it, your time is very likely much better spent studying LLVM and how to port it to a new architecture, than trying to propose things to this group without understanding what you’re proposing.

You may also find different languages or language extensions (many supported by LLVM) that help targeting highly parallel hardware such as GPUs, that perhaps better fit your hardware architecture.

At least you might come out of that exercise with much better understanding of the relationship between your hardware and compilers.

(My 2 cents.)

Marc

From: Std-Proposals <std-proposals-bounces@lists.isocpp.org> On Behalf Of Kamalesh Lakkampally via Std-Proposals
Sent: Wednesday, December 3, 2025 21:46
To: std-proposals@lists.isocpp.org
Cc: Kamalesh Lakkampally <info@chipnadi.com>
Subject: Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics

Hello Thiago,

Thank you for the thoughtful question — it touches on the central issue of whether this idea belongs in the C++ Standard or should remain an architecture-specific extension.
Let me clarify the motivation more precisely, because the original message did not convey the full context.

1. This proposal is not about prefetching or cache-control

The draft text unfortunately used terminology that sounded hardware-adjacent, but the intent is not to introduce anything analogous to:

prefetch intrinsics
non-temporal loads/stores
cache control hints
pipeline fetch controls

Those are ISA-level optimizations and naturally belong in compiler extensions or architecture-specific intrinsics.

The concept here is completely different and exists at the semantic level, not the CPU-microarchitecture level.

2. The actual concept: explicit sequencing of dynamically changing micro-tasks

Many event-driven systems (HDL simulators are just one example) share a common execution model:

Thousands of micro-tasks
The order of tasks is computed dynamically
The order changes every cycle or even more frequently
Tasks themselves are small and often side-effect-free
A dispatcher function repeatedly selects the next task to run

In C++ today, this typically results in a control-flow pattern like:

dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …

even when the intended evaluation sequence is conceptually:

T1 → T2 → T3 → T4 → …

The key issue is expressiveness: C++ currently has no mechanism to express “evaluate these things in this dynamic order, without re-entering a dispatcher between each step”.

Coroutines, tasks, executors, thread pools, and dispatch loops all still fundamentally operate through:

repeated function calls, or
repeated returns to a central controller, or
runtime-managed schedulers

which means that as programs scale down to extremely fine-grained tasks, sequencing overhead becomes dominant, even on architectures where prefetching is not a concern.

3. Why this is not architecture-specific

The misunderstanding arises when “fetch-only operation” sounds like a CPU fetch-stage mechanism.

The actual idea is:

A mechanism in the abstract machine that allows a program to express a sequence of evaluations that does not pass through a central dispatcher after each step.

This can be implemented portably in many ways:

Using a software-run interpreter that consumes the sequencing structure
Using an implementation-specific optimization when available
Mapping to architecture-specific controls on processors that support such mechanisms
Or completely lowering into normal calls on simpler hardware

This is similar in spirit to:

coroutines: abstract machine semantics, many possible implementations
atomics: abstract semantics, architecture maps them to whatever operations it has
SIMD types: portable semantics, mapped to different instructions per architecture
executors: abstract relationships, many possible backends

So the goal is not to standardize “fetch-stage hardware instructions”,
but to explore whether C++ can expose a portable semantic form of sequencing that compilers may optimize very differently depending on platform capabilities.

4. Why prefetch intrinsics are insufficient

Prefetch intrinsics provide:

data locality hints
non-temporal load/store hints
per-architecture micro-optimizations

They do not provide:

a semantic representation of evaluation order
a way to represent a dynamically computed schedule
a portable abstraction
a way to eliminate dispatcher re-entry in the program model

Prefetching does not remove the repeated calls/returns between tasks.
This proposal focuses on expressing intent, not on cache behavior.

5. Why this might belong in the Standard

Because the idea is:

semantic, not architectural
portable across CPU, GPU, and FPGA-style systems
potentially optimizable by compilers
relevant to domains beyond HDL tools
conceptually related to execution agents, coroutines, and sequencing in executors
about exposing user intent (dynamic ordering), not hardware control

Many modern workloads — including simulation, actor frameworks, reactive graph engines, and fine-grained schedulers — could benefit from a portable way to express:

“Here is the next unit of evaluation; no need to return to a dispatcher.”

even if the implementation varies drastically per target platform.

6. Early-stage nature

This is still an early R0 exploratory draft.
I fully expect that the idea will require:

reframing in abstract-machine terminology
better examples
clarification of how sequencing is expressed
exploration of implementability on real compilers

I appreciate your question because it helps anchor the discussion in the right conceptual layer.

Thank you again for engaging — your perspective is extremely valuable.

On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via Std-Proposals <std-proposals@lists.isocpp.org> wrote:

On Wednesday, 3 December 2025 03:38:57 Pacific Standard Time Kamalesh
Lakkampally via Std-Proposals wrote:
> The goal is to support workloads where the execution order of micro-tasks
> changes dynamically and unpredictably every cycle, such as *event-driven
> HDL/SystemVerilog simulation*.
> In such environments, conventional C++ mechanisms (threads, coroutines,
> futures, indirect calls, executors) incur significant pipeline redirection
> penalties. Fetch-only instructions aim to address this problem in a
> structured, language-visible way.
>
> I would greatly appreciate *feedback, criticism, and suggestions* from the
> community.
> I am also *open to collaboration.*

Can you explain why this should be in the Standard? Why are the prefetch
intrinsics available as compiler extensions for a lot of architectures not
enough? My first reaction is that this type of design is going to be very
architecture-specific by definition, so using architecture-specific extensions
should not be an impediment.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel Data Center - Platform & Sys. Eng.
--
Std-Proposals mailing list
Std-Proposals@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

--
Std-Proposals mailing list
Std-Proposals@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals