C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Thu, 4 Dec 2025 14:08:57 +0100
First of all, calling this a "fetch-only" thing is a huge misnomer
for me. There are lots of things in C++ that influence control flow
only, but don't (e.g.) modify state or similar. Such as a while loop.
Call it "novel control flow" if you wish.

You've addressed the question why "goto" doesn't work for you
(need cross-function), so maybe "longjmp" is a better model for
you? We also have stateful coroutines in the making (look for
"fibers"), which might help you.

Talking about "8-bit" in a proposal purely about the abstract machine
is entirely unhelpful; we don't specify the bit-width of an "int",
and we certainly should not specify the bit-width of something that
can hold a thread context handle. (I have more than 256 threads,
for sure.)

You're talking about "declarative", but I don't understand how
much dynamic influence you want to have.

Can a task T1 enter another task T5 into the queue, for later
execution? If not, how do those tasks come into being in
the first place?

Jens



On 12/4/25 09:15, Kamalesh Lakkampally via Std-Proposals wrote:
> Hi Robin,
>
> Thank you, that’s a very helpful way to frame the questions.
>
>
> 1. What you understood correctly
>
> Yes, at a high level you understood the intention:
>
> *
>
> I want to avoid repeatedly “returning to the dispatcher” (or central caller) between micro-tasks.
>
> *
>
> I want the program to be able to describe a sequence of tasks that can change dynamically at runtime, and then have execution proceed along that sequence directly.
>
> So instead of:
>
> dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → ...
>
> the /intent/ is to express:
> T1 → T2 → T3 → ...
>
>
> where the exact order is computed dynamically.
>
>
> 2. Why this is not just |goto|
>
> A |goto| in C++:
>
> *
>
> is *local to a function*
>
> *
>
> jumps to a label in the /same/ function
>
> *
>
> still lives entirely within the usual call/return and stack discipline
>
> *
>
> does not encode any higher-level scheduling or context
>
> What I am exploring is /not/ a new low-level transfer like |goto|, but a way to:
>
> *
>
> build a *sequence of tasks* (potentially different functions or micro-tasks),
>
> *
>
> optionally annotated with small context identifiers,
>
> *
>
> that a runtime can then walk through as the next evaluation steps.
>
> So it’s closer to:
>
> *
>
> “a declarative schedule of what to run next”
>
> than to:
>
> *
>
> “a raw jump from one instruction to another.”
>
> The semantics I have in mind *do not replace* C++’s existing call/return or stack behavior; they add a separate, higher-level description of evaluation order.
>
>
> 3. Calling conventions and ABI
>
> You are absolutely right that anything that /directly/ bypasses call/return and ABI would be very problematic.
>
> To be clear: I am *not* proposing that C++ programmers manually circumvent calling conventions or stack protocols. Any eventual implementation in a conforming C++ implementation would still:
>
> *
>
> obey the platform ABI,
>
> *
>
> use normal calls/returns or coroutine resumption under the hood, or
>
> *
>
> use some runtime mechanism that remains conforming.
>
> The proposal is about *expressing the sequencing intent in the language/abstract machine*, not about specifying a particular low-level jump mechanism. How that sequencing is lowered (calls, trampolines, state machines, etc.) would be implementation-defined.
>
>
> 4. Performance evidence (honest status)
>
> Right now, this is at the /research / exploratory/ stage:
>
> *
>
> The idea is motivated by internal experiments in event-driven simulation workloads, where dispatcher-style control-flow becomes a bottleneck.
>
> *
>
> I do *not* yet have a standardized, portable C++ compiler extension that we can benchmark across architectures.
>
> *
>
> I agree that for this to move beyond “interesting idea”, we will need at least one concrete prototype and some performance data.
>
> So in that sense, you are absolutely right to ask for measurements: they will be essential before any serious standardization attempt. At this point I am trying to:
>
> *
>
> check whether the conceptual direction makes sense to the committee, and
>
> *
>
> learn how to better express it in abstract-machine terms,
>
> *
>
> before investing in a more complete implementation and measurement campaign.
>
> Thanks again for the questions — they help me refine both the technical and explanatory sides of the proposal.
>
>
>
> *Best Regards,
> Kamalesh Lakkampally,*
> *Founder & CEO*
> www.chipnadi.com <http://www.chipnadi.com>
>
>
>
> On Thu, Dec 4, 2025 at 12:44 PM Robin Savonen Söderholm via Std-Proposals <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>> wrote:
>
> So if I understand correctly, you want to avoid "returning to the callee" and rather just jump to the next "task" (as something similar but not quite a function call) in hope that it would speed up the process? And you want to dynamically change what ever is next on the queue?
> If so, how does this differ from e.g. "goto"? And can we get in trouble because this sounds as we will need to circumvent/handle manually things from the calling conventions? I wonder if we really can get so much performance from this proposal. Do you gave a (platform specific) example project that proves that your particular compiler extension indeed can give measurable performance improvements?
>
> // Robin
>
> On Thu, Dec 4, 2025, 07:43 Kamalesh Lakkampally via Std-Proposals <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>> wrote:
>
> Hi Marc,
>
> Thank you for your comments. Since the mailing list strips attachments, you have not seen the core details of the idea, so let me restate the proposal from the beginning.
>
>
> 1. *What the proposal is NOT*
>
> It is *not*:
>
> *
>
> a new CPU instruction
>
> *
>
> a hardware prefetching mechanism
>
> *
>
> a cache hint
>
> *
>
> a pipeline control mechanism
>
> *
>
> an optimization directive
>
> *
>
> a request for compiler backend changes
>
> *
>
> a form of parallelism or GPU-style execution
>
> None of these describe the concept accurately.
>
> The proposal operates *purely at the C++ abstract-machine level*.
>
>
> 2. *What the proposal /is/: A new way to express dynamic sequencing of micro-tasks*
>
> Many event-driven systems maintain a queue of very small tasks ("micro-tasks") whose execution order changes /frequently/ at runtime:
>
> At one moment: T1 → T2 → T3
> Later: T3 → T4 → T1 → T2
>
> In C++ today, these systems must route control through a dispatcher:
>
> dispatcher();
> → T1();
> dispatcher();
> → T2();
> dispatcher();
> → T3();
>
> Even though the /intended/program order is simply:
>
> T1 → T2 → T3
>
> This repeated dispatcher → task → dispatcher pattern:
>
> *
>
> is semantically unnecessary
>
> *
>
> consumes execution bandwidth
>
> *
>
> prevents compiler optimization
>
> *
>
> introduces unpredictable control flow
>
> *
>
> creates overhead for extremely fine-grained tasks
>
> The proposal asks:
>
> *Can C++ express dynamic sequencing declaratively, without requiring the program to re-enter the dispatcher between every micro-task?*
>
>
> 3. *The key idea: “Fetch-only operations”*
>
> A /fetch-only operation/ is a C++ semantic construct that:
>
>
> ✔ does NOT compute
>
>
> ✔ does NOT read or write memory
>
>
> ✔ does NOT have observable side effects
>
>
> ✔ does NOT correspond to a function call or branch
>
>
> ✔ EXISTS ONLY to describe “what executes next”
>
> In other words, it is a *pure sequencing directive*, not an instruction or computation.
>
> For example (placeholder syntax):
>
> |fad q[i] = next_address; fcd q[i] = thread_context; fed q[i] = exec_context; |
>
> These operations place *sequencing metadata*into a dedicated structure.
>
>
> 4. *What metadata is being represented?*
>
> Each “micro-task” is associated with small context fields:
>
> *
>
> *8-bit thread-context*
>
> *
>
> *8-bit execution-context*
>
> *
>
> *an instruction address (or function entry)*
>
> These fields allow the program to encode:
>
> “After completing this task, the /next/ task to evaluate is at address X,
> but only if its context matches Y.”
>
> This enables the expression of dynamic scheduling decisions *without returning through the dispatcher*.
>
>
> 5. *Fetch-Only Region: where this metadata lives*
>
> Just as C++ programs conceptually have:
>
> *
>
> a *stack region* (automatic storage)
>
> *
>
> a *heap region* (dynamic storage)
>
> the proposal introduces a *fetch-only region*:
>
> *
>
> memory that stores sequencing metadata
>
> *
>
> strictly controlled by the implementation--> MMU(memory management unit)
>
> *
>
> with context validation
>
> *
>
> not accessible for ordinary loads/stores
>
> *
>
> used only for sequencing, not computation
>
> This region is *not hardware-specific*; it is an abstract-machine concept, much like the thread-local storage model or atomic synchronization regions.
>
>
> 6. *Why this belongs at the language level*
>
> This is not expressible today because C++ has *no construct to describe sequencing separately from execution*.
>
> Existing mechanisms:
>
>
> Threads / tasks / executors
>
> → Require execution-path transitions
>
>
> Coroutines
>
> → Maintain suspended execution frames
>
>
> Function calls
>
> → Require call/return semantics
>
>
> Dispatch loops
>
> → Centralize sequencing and exhibit overhead
>
>
> As-if rule
>
> → Cannot remove dispatcher calls; they are semantically required
>
> Dynamic sequencing is fundamentally:
>
> *
>
> *not parallelism*,
>
> *
>
> *not computation*,
>
> *
>
> *not scheduling*,
>
> *
>
> *not hardware control*.
>
> It is *language-level intent*:
>
> “Evaluate micro-tasks in this dynamic order, without routing control back through a dispatcher.”
>
> This cannot be expressed with intrinsics, LLVM passes, or target-specific code.
>
>
> 7. *Why this is portable*
>
> Different compilers/runtimes/architectures may choose different implementations:
>
> *
>
> pure software interpreter for sequencing
>
> *
>
> optimizing JIT
>
> *
>
> coroutine resumption graph
>
> *
>
> architecture-specific fast paths (optional)
>
> *
>
> or normal dispatch loops as fallback
>
> The semantics remain:
>
> *Fetch-only operations provide sequencing metadata
> fetch-only region stores it
> execution proceeds in the declared order.*
>
> This is a valid addition to the C++ abstract machine, not a hardware feature.
>
>
> *Why context fields matter*
>
> Context metadata enables:
>
> *
>
> *correct sequencing:* ensuring that only valid successor tasks are chosen
>
> *
>
> *safety checks:* preventing unintended jumps to unrelated micro-tasks
>
> *
>
> *structural integrity:* maintaining a well-defined evaluation graph
>
>
> *Security aspect*
>
> Because the sequencing structure is stored in a dedicated *fetch-only region*—conceptually similar to the way the stack and heap represent distinct memory roles—the context fields also allow:
>
> *
>
> *validation of allowed transitions*,
>
> *
>
> *prevention of unauthorized or accidental modification*, and
>
> *
>
> *protection against control-flow corruption.*
>
> In other words, the combination of:
>
> *
>
> *context identifiers*, and
>
> *
>
> *a dedicated fetch-only region (analogous to stack/heap regions)*
>
> provides a framework in which implementations can enforce both correctness and security properties for dynamic sequencing.
>
> This occurs entirely at the semantic level; no explicit hardware behavior is assumed.
>
>
> 8. *Why the proposal exists*
>
> Highly dynamic micro-task workloads (event-driven simulation is just one example) cannot currently express their intent in C++ without:
>
> *
>
> repeated dispatcher calls
>
> *
>
> unnecessary control-flow redirections
>
> *
>
> significant overhead for fine-grained scheduling
>
> This proposal explores whether C++ can support such workloads in a portable and declarative manner.
>
>
> 9. *Still early-stage*
>
> This is an R0 exploratory draft. I am refining terminology, abstract-machine semantics, and examples with the help of feedback from this discussion.
>
>
>
> *Best Regards,
> Kamalesh Lakkampally,*
> *Founder & CEO*
> www.chipnadi.com <http://www.chipnadi.com>
>
>
>
> On Thu, Dec 4, 2025 at 11:49 AM Marc Edouard Gauthier via Std-Proposals <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>> wrote:
>
> Kamalesh,____
>
> __ __
>
> It’s not clear that you’re looking to change anything at all in the language. If you are, you haven’t said exactly what it is.____
>
> __ __
>
> It seems more that you have an unusual highly parallel hardware architecture, and that what you’re looking for is a very different compiler implementation, not a different or modified language.____
>
> For example, in C++ or most any procedural language, you can write some sequence of independent steps:____
>
> __ __
>
> a = b + c;____
>
> d = e * f;____
>
> g = 3 + 5 / h;____
>
> __ __
>
> The compiler (compiler backend typically) is totally free to reorder these, or dispatch these, in whatever way the underlying hardware architecture allows.____
>
> If there are dependencies, such as say, `k = a – 1;` the compiler ensures operations are ordered to satisfy the dependency, in whatever way the hardware architecture allows for.____
>
> __ __
>
> So it seems already possible to express “micro-tasks”, whatever these might be, as simple independent C++ statements.____
>
> __ __
>
> Assuming you have a novel computer hardware architecture, and you want compiler support for it, your time is very likely much better spent studying LLVM and how to port it to a new architecture, than trying to propose things to this group without understanding what you’re proposing.____
>
> You may also find different languages or language extensions (many supported by LLVM) that help targeting highly parallel hardware such as GPUs, that perhaps better fit your hardware architecture.____
>
> __ __
>
> At least you might come out of that exercise with much better understanding of the relationship between your hardware and compilers.____
>
> __ __
>
> (My 2 cents.)____
>
> __ __
>
> Marc____
>
> __ __
>
> __ __
>
> *From:*Std-Proposals <std-proposals-bounces_at_[hidden] <mailto:std-proposals-bounces_at_[hidden]>> *On Behalf Of *Kamalesh Lakkampally via Std-Proposals
> *Sent:* Wednesday, December 3, 2025 21:46
> *To:* std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>
> *Cc:* Kamalesh Lakkampally <info_at_[hidden] <mailto:info_at_[hidden]>>
> *Subject:* Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics____
>
> __ __
>
> Hello Thiago,____
>
> Thank you for the thoughtful question — it touches on the central issue of whether this idea belongs in the C++ Standard or should remain an architecture-specific extension.
> Let me clarify the motivation more precisely, because the original message did not convey the full context.____
>
>
> *1. This proposal is */not/*about prefetching or cache-control*____
>
> The draft text unfortunately used terminology that sounded hardware-adjacent, but the intent is /not/ to introduce anything analogous to:____
>
> * prefetch intrinsics____
> * non-temporal loads/stores____
> * cache control hints____
> * pipeline fetch controls____
>
> Those are ISA-level optimizations and naturally belong in compiler extensions or architecture-specific intrinsics.____
>
> The concept here is completely different and exists at the *semantic level*, not the CPU-microarchitecture level.____
>
> __ __
>
>
> *2. The actual concept: explicit sequencing of dynamically changing micro-tasks*____
>
> Many event-driven systems (HDL simulators are just one example) share a common execution model:____
>
> * Thousands of /micro-tasks/____
> * The order of tasks is computed dynamically____
> * The order changes every cycle or even more frequently____
> * Tasks themselves are small and often side-effect-free____
> * A dispatcher function repeatedly selects the next task to run____
>
> In C++ today, this typically results in a control-flow pattern like:____
>
> dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …____
>
> even when the /intended evaluation sequence/ is conceptually:____
>
> T1 → T2 → T3 → T4 → …____
>
> The key issue is *expressiveness*: C++ currently has no mechanism to express “evaluate these things in this dynamic order, without re-entering a dispatcher between each step”.____
>
> Coroutines, tasks, executors, thread pools, and dispatch loops all still fundamentally operate through:____
>
> * repeated function calls, or____
> * repeated returns to a central controller, or____
> * runtime-managed schedulers____
>
> which means that as programs scale down to extremely fine-grained tasks, *sequencing overhead becomes dominant*, even on architectures where prefetching is not a concern.____
>
> __ __
>
>
> *3. Why this is not architecture-specific*____
>
> The misunderstanding arises when “fetch-only operation” sounds like a CPU fetch-stage mechanism.____
>
> The actual idea is:____
>
> A mechanism in the abstract machine that allows a program to express a /sequence of evaluations/ that does not pass through a central dispatcher after each step.____
>
> This can be implemented portably in many ways:____
>
> * Using a software-run interpreter that consumes the sequencing structure____
> * Using an implementation-specific optimization when available____
> * Mapping to architecture-specific controls on processors that support such mechanisms____
> * Or completely lowering into normal calls on simpler hardware____
>
> This is similar in spirit to:____
>
> * *coroutines*: abstract machine semantics, many possible implementations____
> * *atomics*: abstract semantics, architecture maps them to whatever operations it has____
> * *SIMD types*: portable semantics, mapped to different instructions per architecture____
> * *executors*: abstract relationships, many possible backends____
>
> So the goal is not to standardize “fetch-stage hardware instructions”,
> but to explore whether C++ can expose a *portable semantic form of sequencing that compilers may optimize very differently depending on platform capabilities*.____
>
> __ __
>
>
> *4. Why prefetch intrinsics are insufficient*____
>
> Prefetch intrinsics provide:____
>
> * data locality hints____
> * non-temporal load/store hints____
> * per-architecture micro-optimizations____
>
> They do *not* provide:____
>
> * a semantic representation of evaluation order____
> * a way to represent a dynamically computed schedule____
> * a portable abstraction____
> * a way to eliminate dispatcher re-entry in the program model____
>
> Prefetching does not remove the repeated calls/returns between tasks.
> This proposal focuses on /expressing intent/, not on cache behavior.____
>
>
> *5. Why this might belong in the Standard*____
>
> Because the idea is:____
>
> * semantic, not architectural____
> * portable across CPU, GPU, and FPGA-style systems____
> * potentially optimizable by compilers____
> * relevant to domains beyond HDL tools____
> * conceptually related to execution agents, coroutines, and sequencing in executors____
> * about exposing user intent (dynamic ordering), not hardware control____
>
> Many modern workloads — including simulation, actor frameworks, reactive graph engines, and fine-grained schedulers — could benefit from a portable way to express:____
>
> “Here is the next unit of evaluation; no need to return to a dispatcher.”____
>
> even if the implementation varies drastically per target platform.____
>
> __ __
>
>
> *6. Early-stage nature*____
>
> This is still an early R0 exploratory draft.
> I fully expect that the idea will require:____
>
> * reframing in abstract-machine terminology____
> * better examples____
> * clarification of how sequencing is expressed____
> * exploration of implementability on real compilers____
>
> I appreciate your question because it helps anchor the discussion in the right conceptual layer.____
>
> Thank you again for engaging — your perspective is extremely valuable.____
>
> __ __
>
> __ __
>
> __ __
>
> __ __
>
> On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via Std-Proposals <std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>> wrote:____
>
> On Wednesday, 3 December 2025 03:38:57 Pacific Standard Time Kamalesh
> Lakkampally via Std-Proposals wrote:
> > The goal is to support workloads where the execution order of micro-tasks
> > changes dynamically and unpredictably every cycle, such as *event-driven
> > HDL/SystemVerilog simulation*.
> > In such environments, conventional C++ mechanisms (threads, coroutines,
> > futures, indirect calls, executors) incur significant pipeline redirection
> > penalties. Fetch-only instructions aim to address this problem in a
> > structured, language-visible way.
> >
> > I would greatly appreciate *feedback, criticism, and suggestions* from the
> > community.
> > I am also *open to collaboration.*
>
> Can you explain why this should be in the Standard? Why are the prefetch
> intrinsics available as compiler extensions for a lot of architectures not
> enough? My first reaction is that this type of design is going to be very
> architecture-specific by definition, so using architecture-specific extensions
> should not be an impediment.
>
> --
> Thiago Macieira - thiago (AT) macieira.info <https://urldefense.us/v3/__http:/macieira.info__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfiyjPrE3A$> - thiago (AT) kde.org <https://urldefense.us/v3/__http:/kde.org__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfhL3Q_pjg$>
> Principal Engineer - Intel Data Center - Platform & Sys. Eng.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <https://urldefense.us/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/std-proposals__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfipwWQT2w$>____
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden] <mailto:Std-Proposals_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
>
>

Received on 2025-12-04 13:09:08