ISOCPP std-proposals List: Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics

From: Kamalesh Lakkampally <info_at_[hidden]>
Date: Thu, 4 Dec 2025 11:15:37 +0530

Hello Thiago,

Thank you for the thoughtful question — it touches on the central issue of
whether this idea belongs in the C++ Standard or should remain an
architecture-specific extension.
Let me clarify the motivation more precisely, because the original message
did not convey the full context.
*1. This proposal is not about prefetching or cache-control*

The draft text unfortunately used terminology that sounded
hardware-adjacent, but the intent is *not* to introduce anything analogous
to:

   -

   prefetch intrinsics
   -

   non-temporal loads/stores
   -

   cache control hints
   -

   pipeline fetch controls

Those are ISA-level optimizations and naturally belong in compiler
extensions or architecture-specific intrinsics.

The concept here is completely different and exists at the *semantic level*,
not the CPU-microarchitecture level.

*2. The actual concept: explicit sequencing of dynamically changing
micro-tasks*

Many event-driven systems (HDL simulators are just one example) share a
common execution model:

   -

   Thousands of *micro-tasks*
   -

   The order of tasks is computed dynamically
   -

   The order changes every cycle or even more frequently
   -

   Tasks themselves are small and often side-effect-free
   -

   A dispatcher function repeatedly selects the next task to run

In C++ today, this typically results in a control-flow pattern like:

dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …

even when the *intended evaluation sequence* is conceptually:

T1 → T2 → T3 → T4 → …

The key issue is *expressiveness*: C++ currently has no mechanism to
express “evaluate these things in this dynamic order, without re-entering a
dispatcher between each step”.

Coroutines, tasks, executors, thread pools, and dispatch loops all still
fundamentally operate through:

   -

   repeated function calls, or
   -

   repeated returns to a central controller, or
   -

   runtime-managed schedulers

which means that as programs scale down to extremely fine-grained
tasks, *sequencing
overhead becomes dominant*, even on architectures where prefetching is not
a concern.

*3. Why this is not architecture-specific*

The misunderstanding arises when “fetch-only operation” sounds like a CPU
fetch-stage mechanism.

The actual idea is:

A mechanism in the abstract machine that allows a program to express a
*sequence
of evaluations* that does not pass through a central dispatcher after each
step.

This can be implemented portably in many ways:

   -

   Using a software-run interpreter that consumes the sequencing structure
   -

   Using an implementation-specific optimization when available
   -

   Mapping to architecture-specific controls on processors that support
   such mechanisms
   -

   Or completely lowering into normal calls on simpler hardware

This is similar in spirit to:

   -

   *coroutines*: abstract machine semantics, many possible implementations
   -

   *atomics*: abstract semantics, architecture maps them to whatever
   operations it has
   -

   *SIMD types*: portable semantics, mapped to different instructions per
   architecture
   -

   *executors*: abstract relationships, many possible backends

So the goal is not to standardize “fetch-stage hardware instructions”,
but to explore whether C++ can expose a *portable semantic form of
sequencing that compilers may optimize very differently depending on
platform capabilities*.

*4. Why prefetch intrinsics are insufficient*

Prefetch intrinsics provide:

   -

   data locality hints
   -

   non-temporal load/store hints
   -

   per-architecture micro-optimizations

They do *not* provide:

   -

   a semantic representation of evaluation order
   -

   a way to represent a dynamically computed schedule
   -

   a portable abstraction
   -

   a way to eliminate dispatcher re-entry in the program model

Prefetching does not remove the repeated calls/returns between tasks.
This proposal focuses on *expressing intent*, not on cache behavior.
*5. Why this might belong in the Standard*

Because the idea is:

   -

   semantic, not architectural
   -

   portable across CPU, GPU, and FPGA-style systems
   -

   potentially optimizable by compilers
   -

   relevant to domains beyond HDL tools
   -

   conceptually related to execution agents, coroutines, and sequencing in
   executors
   -

   about exposing user intent (dynamic ordering), not hardware control

Many modern workloads — including simulation, actor frameworks, reactive
graph engines, and fine-grained schedulers — could benefit from a portable
way to express:

“Here is the next unit of evaluation; no need to return to a dispatcher.”

even if the implementation varies drastically per target platform.

*6. Early-stage nature*

This is still an early R0 exploratory draft.
I fully expect that the idea will require:

   -

   reframing in abstract-machine terminology
   -

   better examples
   -

   clarification of how sequencing is expressed
   -

   exploration of implementability on real compilers

I appreciate your question because it helps anchor the discussion in the
right conceptual layer.

Thank you again for engaging — your perspective is extremely valuable.

On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> On Wednesday, 3 December 2025 03:38:57 Pacific Standard Time Kamalesh
> Lakkampally via Std-Proposals wrote:
> > The goal is to support workloads where the execution order of micro-tasks
> > changes dynamically and unpredictably every cycle, such as *event-driven
> > HDL/SystemVerilog simulation*.
> > In such environments, conventional C++ mechanisms (threads, coroutines,
> > futures, indirect calls, executors) incur significant pipeline
> redirection
> > penalties. Fetch-only instructions aim to address this problem in a
> > structured, language-visible way.
> >
> > I would greatly appreciate *feedback, criticism, and suggestions* from
> the
> > community.
> > I am also *open to collaboration.*
>
> Can you explain why this should be in the Standard? Why are the prefetch
> intrinsics available as compiler extensions for a lot of architectures not
> enough? My first reaction is that this type of design is going to be very
> architecture-specific by definition, so using architecture-specific
> extensions
> should not be an impediment.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Data Center - Platform & Sys. Eng.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2025-12-04 05:46:16