C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics

From: Kamalesh Lakkampally <founder_at_[hidden]>
Date: Thu, 4 Dec 2025 13:45:16 +0530
Hi Robin,

Thank you, that’s a very helpful way to frame the questions.
1. What you understood correctly

Yes, at a high level you understood the intention:

   -

   I want to avoid repeatedly “returning to the dispatcher” (or central
   caller) between micro-tasks.
   -

   I want the program to be able to describe a sequence of tasks that can
   change dynamically at runtime, and then have execution proceed along that
   sequence directly.

So instead of:

  dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → ...
  the *intent* is to express:
T1 → T2 → T3 → ...


where the exact order is computed dynamically.
2. Why this is not just goto

A goto in C++:

   -

   is *local to a function*
   -

   jumps to a label in the *same* function
   -

   still lives entirely within the usual call/return and stack discipline
   -

   does not encode any higher-level scheduling or context

What I am exploring is *not* a new low-level transfer like goto, but a way
to:

   -

   build a *sequence of tasks* (potentially different functions or
   micro-tasks),
   -

   optionally annotated with small context identifiers,
   -

   that a runtime can then walk through as the next evaluation steps.

So it’s closer to:

   -

   “a declarative schedule of what to run next”

than to:

   -

   “a raw jump from one instruction to another.”

The semantics I have in mind *do not replace* C++’s existing call/return or
stack behavior; they add a separate, higher-level description of evaluation
order.
3. Calling conventions and ABI

You are absolutely right that anything that *directly* bypasses call/return
and ABI would be very problematic.

To be clear: I am *not* proposing that C++ programmers manually circumvent
calling conventions or stack protocols. Any eventual implementation in a
conforming C++ implementation would still:

   -

   obey the platform ABI,
   -

   use normal calls/returns or coroutine resumption under the hood, or
   -

   use some runtime mechanism that remains conforming.

The proposal is about *expressing the sequencing intent in the
language/abstract machine*, not about specifying a particular low-level
jump mechanism. How that sequencing is lowered (calls, trampolines, state
machines, etc.) would be implementation-defined.
4. Performance evidence (honest status)

Right now, this is at the *research / exploratory* stage:

   -

   The idea is motivated by internal experiments in event-driven simulation
   workloads, where dispatcher-style control-flow becomes a bottleneck.
   -

   I do *not* yet have a standardized, portable C++ compiler extension that
   we can benchmark across architectures.
   -

   I agree that for this to move beyond “interesting idea”, we will need at
   least one concrete prototype and some performance data.

So in that sense, you are absolutely right to ask for measurements: they
will be essential before any serious standardization attempt. At this point
I am trying to:

   -

   check whether the conceptual direction makes sense to the committee, and
   -

   learn how to better express it in abstract-machine terms,
   -

   before investing in a more complete implementation and measurement
   campaign.

Thanks again for the questions — they help me refine both the technical and
explanatory sides of the proposal.




*Best Regards,Kamalesh Lakkampally,*
*Founder & CEO*
www.chipnadi.com



On Thu, Dec 4, 2025 at 12:44 PM Robin Savonen Söderholm via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> So if I understand correctly, you want to avoid "returning to the callee"
> and rather just jump to the next "task" (as something similar but not quite
> a function call) in hope that it would speed up the process? And you want
> to dynamically change what ever is next on the queue?
> If so, how does this differ from e.g. "goto"? And can we get in trouble
> because this sounds as we will need to circumvent/handle manually things
> from the calling conventions? I wonder if we really can get so much
> performance from this proposal. Do you gave a (platform specific) example
> project that proves that your particular compiler extension indeed can give
> measurable performance improvements?
>
> // Robin
>
> On Thu, Dec 4, 2025, 07:43 Kamalesh Lakkampally via Std-Proposals <
> std-proposals_at_[hidden]> wrote:
>
>> Hi Marc,
>>
>> Thank you for your comments. Since the mailing list strips attachments,
>> you have not seen the core details of the idea, so let me restate the
>> proposal from the beginning.
>> 1. *What the proposal is NOT*
>>
>> It is *not*:
>>
>> -
>>
>> a new CPU instruction
>> -
>>
>> a hardware prefetching mechanism
>> -
>>
>> a cache hint
>> -
>>
>> a pipeline control mechanism
>> -
>>
>> an optimization directive
>> -
>>
>> a request for compiler backend changes
>> -
>>
>> a form of parallelism or GPU-style execution
>>
>> None of these describe the concept accurately.
>>
>> The proposal operates *purely at the C++ abstract-machine level*.
>>
>>
>> 2. *What the proposal is: A new way to express dynamic sequencing of
>> micro-tasks*
>>
>> Many event-driven systems maintain a queue of very small tasks
>> ("micro-tasks") whose execution order changes *frequently* at runtime:
>>
>> At one moment: T1 → T2 → T3
>> Later: T3 → T4 → T1 → T2
>>
>> In C++ today, these systems must route control through a dispatcher:
>> dispatcher();
>> → T1();
>> dispatcher();
>> → T2();
>> dispatcher();
>> → T3();
>>
>> Even though the *intended* program order is simply:
>> T1 → T2 → T3
>>
>> This repeated dispatcher → task → dispatcher pattern:
>>
>> -
>>
>> is semantically unnecessary
>> -
>>
>> consumes execution bandwidth
>> -
>>
>> prevents compiler optimization
>> -
>>
>> introduces unpredictable control flow
>> -
>>
>> creates overhead for extremely fine-grained tasks
>>
>> The proposal asks:
>>
>> *Can C++ express dynamic sequencing declaratively, without requiring the
>> program to re-enter the dispatcher between every micro-task?*
>>
>>
>> 3. *The key idea: “Fetch-only operations”*
>>
>> A *fetch-only operation* is a C++ semantic construct that:
>> ✔ does NOT compute ✔ does NOT read or write memory ✔ does NOT have
>> observable side effects ✔ does NOT correspond to a function call or
>> branch ✔ EXISTS ONLY to describe “what executes next”
>>
>> In other words, it is a *pure sequencing directive*, not an instruction
>> or computation.
>>
>> For example (placeholder syntax):
>>
>> fad q[i] = next_address;
>> fcd q[i] = thread_context;
>> fed q[i] = exec_context;
>>
>> These operations place *sequencing metadata* into a dedicated structure.
>> 4. *What metadata is being represented?*
>>
>> Each “micro-task” is associated with small context fields:
>>
>> -
>>
>> *8-bit thread-context*
>> -
>>
>> *8-bit execution-context*
>> -
>>
>> *an instruction address (or function entry)*
>>
>> These fields allow the program to encode:
>>
>> “After completing this task, the *next* task to evaluate is at address X,
>> but only if its context matches Y.”
>>
>> This enables the expression of dynamic scheduling decisions *without
>> returning through the dispatcher*.
>>
>>
>> 5. *Fetch-Only Region: where this metadata lives*
>>
>> Just as C++ programs conceptually have:
>>
>> -
>>
>> a *stack region* (automatic storage)
>> -
>>
>> a *heap region* (dynamic storage)
>>
>> the proposal introduces a *fetch-only region*:
>>
>> -
>>
>> memory that stores sequencing metadata
>> -
>>
>> strictly controlled by the implementation--> MMU(memory management
>> unit)
>> -
>>
>> with context validation
>> -
>>
>> not accessible for ordinary loads/stores
>> -
>>
>> used only for sequencing, not computation
>>
>> This region is *not hardware-specific*; it is an abstract-machine
>> concept, much like the thread-local storage model or atomic synchronization
>> regions.
>>
>>
>> 6. *Why this belongs at the language level*
>>
>> This is not expressible today because C++ has *no construct to describe
>> sequencing separately from execution*.
>>
>> Existing mechanisms:
>> Threads / tasks / executors
>>
>> → Require execution-path transitions
>> Coroutines
>>
>> → Maintain suspended execution frames
>> Function calls
>>
>> → Require call/return semantics
>> Dispatch loops
>>
>> → Centralize sequencing and exhibit overhead
>> As-if rule
>>
>> → Cannot remove dispatcher calls; they are semantically required
>>
>> Dynamic sequencing is fundamentally:
>>
>> -
>>
>> *not parallelism*,
>> -
>>
>> *not computation*,
>> -
>>
>> *not scheduling*,
>> -
>>
>> *not hardware control*.
>>
>> It is *language-level intent*:
>>
>> “Evaluate micro-tasks in this dynamic order, without routing control back
>> through a dispatcher.”
>>
>> This cannot be expressed with intrinsics, LLVM passes, or target-specific
>> code.
>>
>>
>> 7. *Why this is portable*
>>
>> Different compilers/runtimes/architectures may choose different
>> implementations:
>>
>> -
>>
>> pure software interpreter for sequencing
>> -
>>
>> optimizing JIT
>> -
>>
>> coroutine resumption graph
>> -
>>
>> architecture-specific fast paths (optional)
>> -
>>
>> or normal dispatch loops as fallback
>>
>> The semantics remain:
>>
>>
>>
>> *Fetch-only operations provide sequencing metadata fetch-only region
>> stores it execution proceeds in the declared order.*
>>
>> This is a valid addition to the C++ abstract machine, not a hardware
>> feature.
>> *Why context fields matter*
>>
>> Context metadata enables:
>>
>> -
>>
>> *correct sequencing:* ensuring that only valid successor tasks are
>> chosen
>> -
>>
>> *safety checks:* preventing unintended jumps to unrelated micro-tasks
>> -
>>
>> *structural integrity:* maintaining a well-defined evaluation graph
>>
>> *Security aspect*
>>
>> Because the sequencing structure is stored in a dedicated *fetch-only
>> region*—conceptually similar to the way the stack and heap represent
>> distinct memory roles—the context fields also allow:
>>
>> -
>>
>> *validation of allowed transitions*,
>> -
>>
>> *prevention of unauthorized or accidental modification*, and
>> -
>>
>> *protection against control-flow corruption.*
>>
>> In other words, the combination of:
>>
>> -
>>
>> *context identifiers*, and
>> -
>>
>> *a dedicated fetch-only region (analogous to stack/heap regions)*
>>
>> provides a framework in which implementations can enforce both
>> correctness and security properties for dynamic sequencing.
>>
>> This occurs entirely at the semantic level; no explicit hardware behavior
>> is assumed.
>> 8. *Why the proposal exists*
>>
>> Highly dynamic micro-task workloads (event-driven simulation is just one
>> example) cannot currently express their intent in C++ without:
>>
>> -
>>
>> repeated dispatcher calls
>> -
>>
>> unnecessary control-flow redirections
>> -
>>
>> significant overhead for fine-grained scheduling
>>
>> This proposal explores whether C++ can support such workloads in a
>> portable and declarative manner.
>>
>>
>> 9. *Still early-stage*
>>
>> This is an R0 exploratory draft. I am refining terminology,
>> abstract-machine semantics, and examples with the help of feedback from
>> this discussion.
>>
>>
>>
>>
>> *Best Regards,Kamalesh Lakkampally,*
>> *Founder & CEO*
>> www.chipnadi.com
>>
>>
>>
>> On Thu, Dec 4, 2025 at 11:49 AM Marc Edouard Gauthier via Std-Proposals <
>> std-proposals_at_[hidden]> wrote:
>>
>>> Kamalesh,
>>>
>>>
>>>
>>> It’s not clear that you’re looking to change anything at all in the
>>> language. If you are, you haven’t said exactly what it is.
>>>
>>>
>>>
>>> It seems more that you have an unusual highly parallel hardware
>>> architecture, and that what you’re looking for is a very different compiler
>>> implementation, not a different or modified language.
>>>
>>> For example, in C++ or most any procedural language, you can write some
>>> sequence of independent steps:
>>>
>>>
>>>
>>> a = b + c;
>>>
>>> d = e * f;
>>>
>>> g = 3 + 5 / h;
>>>
>>>
>>>
>>> The compiler (compiler backend typically) is totally free to reorder
>>> these, or dispatch these, in whatever way the underlying hardware
>>> architecture allows.
>>>
>>> If there are dependencies, such as say, `k = a – 1;` the compiler
>>> ensures operations are ordered to satisfy the dependency, in whatever way
>>> the hardware architecture allows for.
>>>
>>>
>>>
>>> So it seems already possible to express “micro-tasks”, whatever these
>>> might be, as simple independent C++ statements.
>>>
>>>
>>>
>>> Assuming you have a novel computer hardware architecture, and you want
>>> compiler support for it, your time is very likely much better spent
>>> studying LLVM and how to port it to a new architecture, than trying to
>>> propose things to this group without understanding what you’re proposing.
>>>
>>> You may also find different languages or language extensions (many
>>> supported by LLVM) that help targeting highly parallel hardware such as
>>> GPUs, that perhaps better fit your hardware architecture.
>>>
>>>
>>>
>>> At least you might come out of that exercise with much better
>>> understanding of the relationship between your hardware and compilers.
>>>
>>>
>>>
>>> (My 2 cents.)
>>>
>>>
>>>
>>> Marc
>>>
>>>
>>>
>>>
>>>
>>> *From:* Std-Proposals <std-proposals-bounces_at_[hidden]> *On
>>> Behalf Of *Kamalesh Lakkampally via Std-Proposals
>>> *Sent:* Wednesday, December 3, 2025 21:46
>>> *To:* std-proposals_at_[hidden]
>>> *Cc:* Kamalesh Lakkampally <info_at_[hidden]>
>>> *Subject:* Re: [std-proposals] Core-Language Extension for Fetch-Only
>>> Instruction Semantics
>>>
>>>
>>>
>>> Hello Thiago,
>>>
>>> Thank you for the thoughtful question — it touches on the central issue
>>> of whether this idea belongs in the C++ Standard or should remain an
>>> architecture-specific extension.
>>> Let me clarify the motivation more precisely, because the original
>>> message did not convey the full context.
>>> *1. This proposal is **not** about prefetching or cache-control*
>>>
>>> The draft text unfortunately used terminology that sounded
>>> hardware-adjacent, but the intent is *not* to introduce anything
>>> analogous to:
>>>
>>> - prefetch intrinsics
>>> - non-temporal loads/stores
>>> - cache control hints
>>> - pipeline fetch controls
>>>
>>> Those are ISA-level optimizations and naturally belong in compiler
>>> extensions or architecture-specific intrinsics.
>>>
>>> The concept here is completely different and exists at the *semantic
>>> level*, not the CPU-microarchitecture level.
>>>
>>>
>>> *2. The actual concept: explicit sequencing of dynamically changing
>>> micro-tasks*
>>>
>>> Many event-driven systems (HDL simulators are just one example) share a
>>> common execution model:
>>>
>>> - Thousands of *micro-tasks*
>>> - The order of tasks is computed dynamically
>>> - The order changes every cycle or even more frequently
>>> - Tasks themselves are small and often side-effect-free
>>> - A dispatcher function repeatedly selects the next task to run
>>>
>>> In C++ today, this typically results in a control-flow pattern like:
>>>
>>> dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …
>>>
>>> even when the *intended evaluation sequence* is conceptually:
>>>
>>> T1 → T2 → T3 → T4 → …
>>>
>>> The key issue is *expressiveness*: C++ currently has no mechanism to
>>> express “evaluate these things in this dynamic order, without re-entering a
>>> dispatcher between each step”.
>>>
>>> Coroutines, tasks, executors, thread pools, and dispatch loops all still
>>> fundamentally operate through:
>>>
>>> - repeated function calls, or
>>> - repeated returns to a central controller, or
>>> - runtime-managed schedulers
>>>
>>> which means that as programs scale down to extremely fine-grained tasks, *
>>> sequencing overhead becomes dominant*, even on architectures where
>>> prefetching is not a concern.
>>>
>>>
>>> *3. Why this is not architecture-specific*
>>>
>>> The misunderstanding arises when “fetch-only operation” sounds like a
>>> CPU fetch-stage mechanism.
>>>
>>> The actual idea is:
>>>
>>> A mechanism in the abstract machine that allows a program to express a *sequence
>>> of evaluations* that does not pass through a central dispatcher after
>>> each step.
>>>
>>> This can be implemented portably in many ways:
>>>
>>> - Using a software-run interpreter that consumes the sequencing
>>> structure
>>> - Using an implementation-specific optimization when available
>>> - Mapping to architecture-specific controls on processors that
>>> support such mechanisms
>>> - Or completely lowering into normal calls on simpler hardware
>>>
>>> This is similar in spirit to:
>>>
>>> - *coroutines*: abstract machine semantics, many possible
>>> implementations
>>> - *atomics*: abstract semantics, architecture maps them to whatever
>>> operations it has
>>> - *SIMD types*: portable semantics, mapped to different instructions
>>> per architecture
>>> - *executors*: abstract relationships, many possible backends
>>>
>>> So the goal is not to standardize “fetch-stage hardware instructions”,
>>> but to explore whether C++ can expose a *portable semantic form of
>>> sequencing that compilers may optimize very differently depending on
>>> platform capabilities*.
>>>
>>>
>>> *4. Why prefetch intrinsics are insufficient*
>>>
>>> Prefetch intrinsics provide:
>>>
>>> - data locality hints
>>> - non-temporal load/store hints
>>> - per-architecture micro-optimizations
>>>
>>> They do *not* provide:
>>>
>>> - a semantic representation of evaluation order
>>> - a way to represent a dynamically computed schedule
>>> - a portable abstraction
>>> - a way to eliminate dispatcher re-entry in the program model
>>>
>>> Prefetching does not remove the repeated calls/returns between tasks.
>>> This proposal focuses on *expressing intent*, not on cache behavior.
>>> *5. Why this might belong in the Standard*
>>>
>>> Because the idea is:
>>>
>>> - semantic, not architectural
>>> - portable across CPU, GPU, and FPGA-style systems
>>> - potentially optimizable by compilers
>>> - relevant to domains beyond HDL tools
>>> - conceptually related to execution agents, coroutines, and
>>> sequencing in executors
>>> - about exposing user intent (dynamic ordering), not hardware control
>>>
>>> Many modern workloads — including simulation, actor frameworks, reactive
>>> graph engines, and fine-grained schedulers — could benefit from a portable
>>> way to express:
>>>
>>> “Here is the next unit of evaluation; no need to return to a dispatcher.”
>>>
>>> even if the implementation varies drastically per target platform.
>>>
>>>
>>> *6. Early-stage nature*
>>>
>>> This is still an early R0 exploratory draft.
>>> I fully expect that the idea will require:
>>>
>>> - reframing in abstract-machine terminology
>>> - better examples
>>> - clarification of how sequencing is expressed
>>> - exploration of implementability on real compilers
>>>
>>> I appreciate your question because it helps anchor the discussion in the
>>> right conceptual layer.
>>>
>>> Thank you again for engaging — your perspective is extremely valuable.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via Std-Proposals <
>>> std-proposals_at_[hidden]> wrote:
>>>
>>> On Wednesday, 3 December 2025 03:38:57 Pacific Standard Time Kamalesh
>>> Lakkampally via Std-Proposals wrote:
>>> > The goal is to support workloads where the execution order of
>>> micro-tasks
>>> > changes dynamically and unpredictably every cycle, such as
>>> *event-driven
>>> > HDL/SystemVerilog simulation*.
>>> > In such environments, conventional C++ mechanisms (threads, coroutines,
>>> > futures, indirect calls, executors) incur significant pipeline
>>> redirection
>>> > penalties. Fetch-only instructions aim to address this problem in a
>>> > structured, language-visible way.
>>> >
>>> > I would greatly appreciate *feedback, criticism, and suggestions* from
>>> the
>>> > community.
>>> > I am also *open to collaboration.*
>>>
>>> Can you explain why this should be in the Standard? Why are the prefetch
>>> intrinsics available as compiler extensions for a lot of architectures
>>> not
>>> enough? My first reaction is that this type of design is going to be
>>> very
>>> architecture-specific by definition, so using architecture-specific
>>> extensions
>>> should not be an impediment.
>>>
>>> --
>>> Thiago Macieira - thiago (AT) macieira.info
>>> <https://urldefense.us/v3/__http:/macieira.info__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfiyjPrE3A$>
>>> - thiago (AT) kde.org
>>> <https://urldefense.us/v3/__http:/kde.org__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfhL3Q_pjg$>
>>> Principal Engineer - Intel Data Center - Platform & Sys. Eng.
>>> --
>>> Std-Proposals mailing list
>>> Std-Proposals_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>>> <https://urldefense.us/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/std-proposals__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfipwWQT2w$>
>>>
>>> --
>>> Std-Proposals mailing list
>>> Std-Proposals_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>>>
>> --
>> Std-Proposals mailing list
>> Std-Proposals_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2025-12-04 08:15:56