Date: Thu, 4 Dec 2025 18:52:54 +0530
Hello All,
Today, in standard C++, the control-flow structure of a function is
always *statically
bounded*.
Even if there are multiple possible branches, all possible successor
functions are known at compile-time.
For example:
function1 → function2 → function3 ↘ function4
Here the compiler knows that function2 can transfer control only to:
-
function3, or
-
function4.
This is a fundamental assumption behind C++:
*All possible control-flow destinations are statically known, even if the
choice is dynamic.*
But in my case , successor is not statically bounded .
The model I’m working with looks like this:
function1 → function2 → (next function is chosen at runtime)
At runtime, *not at compile time*, the program may write the address of:
-
function3, or
-
function4, or
-
functionN, or
-
any function from a dynamically created set.
In other words:
*The set of possible successors is not known statically.It is constructed
at runtime and may vary extremely frequently.*
This is very different from C++’s normal branching model, where all
successors are part of the program’s fixed control-flow graph.
*1. “Fetch-only” is indeed the wrong name*
I agree.
The term brought in the wrong mental model, and it suggested architectural
behavior that is not part of the proposal.
A much better framing is:
*a declarative sequencing construct (a form of novel control flow)*
that specifies *what should run next*,
without executing computation itself and without the semantics of a
function call.
I’ll adopt clearer terminology going forward.
*2. Why not goto, longjmp, or fibers?*
Yes, the proposal tries to describe cross-function control transfer, but:
-
goto is intra-function only
-
longjmp bypasses stack frames, destroys local state, and is control-flow
destructive
-
fibers/coroutines package execution state and resume points, not *pure
sequencing metadata*
The key distinction is:
*I am not trying to transfer execution state; I am trying to express an
evaluation schedule that is constructed at runtime.*
This is an important difference:
-
coroutines resume captured execution stacks
-
longjmp unwinds stacks
-
goto jumps within a function
But I want to express:
“Task T2 should run next, because scheduling metadata says so —
not because T2 is lexically next or a continuation of T1.”
It is closer to a *runtime-constructed control-flow graph*, not a call
stack manipulation.
*3. The 8-bit context mention was a mistake*
You are absolutely right — fixed-width numbers do not belong in a proposal
about abstract-machine semantics.
The correct description should be:
*Tasks may optionally be annotated with small, implementation-defined
context identifiers that sequencing logic can match against.*
Their width, representation, and meaning would be entirely
implementation-defined.
I should not have written “8-bit” and will correct that.
*4. What does “declarative sequencing” actually mean?*
By “declarative,” I mean:
The program provides *metadata describing which task should run next*,
while the implementation chooses *how* to realize that sequence.
So dynamic influence is allowed — and intended — but the sequencing
construct itself:
-
does not compute
-
does not load or store
-
does not participate in side effects
-
only names the next evaluation target
This is separate from the execution of the tasks themselves.
*5. Yes, tasks may insert new tasks into the schedule*
This is central to the use case.
To answer your question:
*Can a task T1 insert T5 into the schedule?*
Yes — that is exactly the model.
The sequence is *not* fixed; it is *runtime-constructed and frequently
updated*.
This is what distinguishes it from unrolled loops or statically generated
dispatch chains.
The metadata structure effectively becomes:
-
a runtime-modifiable list, or
-
small graph, or
-
chain of successor relationships
built out of ordinary C++ logic.
The proposal is *not* about the container; it is about giving the
implementation freedom to run that schedule without repeatedly routing
through a dispatcher.
*6. Where this differs from existing mechanisms*
Today C++ implements dynamic scheduling with:
while (!q.empty()) {
auto t = q.pop();
t(); // ordinary function call via dispatcher
}
But this carries:
-
call/return overhead
-
mandatory ABI boundaries
-
loss of optimization opportunities (opaque function pointer calls)
-
repeated re-entry into the dispatcher
What I want to explore is whether C++ could have:
*a side-effect-free construct that expresses the sequence itself*
giving implementations permission to execute the sequence more directly.
This is closer to an *explicit, runtime-defined control-flow path*,
not a request for new hardware or new stack semantics.
*Best Regards,Kamalesh Lakkampally,*
*Founder & CEO*
www.chipnadi.com
On Thu, Dec 4, 2025 at 6:39 PM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
> First of all, calling this a "fetch-only" thing is a huge misnomer
> for me. There are lots of things in C++ that influence control flow
> only, but don't (e.g.) modify state or similar. Such as a while loop.
> Call it "novel control flow" if you wish.
>
> You've addressed the question why "goto" doesn't work for you
> (need cross-function), so maybe "longjmp" is a better model for
> you? We also have stateful coroutines in the making (look for
> "fibers"), which might help you.
>
> Talking about "8-bit" in a proposal purely about the abstract machine
> is entirely unhelpful; we don't specify the bit-width of an "int",
> and we certainly should not specify the bit-width of something that
> can hold a thread context handle. (I have more than 256 threads,
> for sure.)
>
> You're talking about "declarative", but I don't understand how
> much dynamic influence you want to have.
>
> Can a task T1 enter another task T5 into the queue, for later
> execution? If not, how do those tasks come into being in
> the first place?
>
> Jens
>
>
>
> On 12/4/25 09:15, Kamalesh Lakkampally via Std-Proposals wrote:
> > Hi Robin,
> >
> > Thank you, that’s a very helpful way to frame the questions.
> >
> >
> > 1. What you understood correctly
> >
> > Yes, at a high level you understood the intention:
> >
> > *
> >
> > I want to avoid repeatedly “returning to the dispatcher” (or central
> caller) between micro-tasks.
> >
> > *
> >
> > I want the program to be able to describe a sequence of tasks that
> can change dynamically at runtime, and then have execution proceed along
> that sequence directly.
> >
> > So instead of:
> >
> > dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → ...
> >
> > the /intent/ is to express:
> > T1 → T2 → T3 → ...
> >
> >
> > where the exact order is computed dynamically.
> >
> >
> > 2. Why this is not just |goto|
> >
> > A |goto| in C++:
> >
> > *
> >
> > is *local to a function*
> >
> > *
> >
> > jumps to a label in the /same/ function
> >
> > *
> >
> > still lives entirely within the usual call/return and stack
> discipline
> >
> > *
> >
> > does not encode any higher-level scheduling or context
> >
> > What I am exploring is /not/ a new low-level transfer like |goto|, but a
> way to:
> >
> > *
> >
> > build a *sequence of tasks* (potentially different functions or
> micro-tasks),
> >
> > *
> >
> > optionally annotated with small context identifiers,
> >
> > *
> >
> > that a runtime can then walk through as the next evaluation steps.
> >
> > So it’s closer to:
> >
> > *
> >
> > “a declarative schedule of what to run next”
> >
> > than to:
> >
> > *
> >
> > “a raw jump from one instruction to another.”
> >
> > The semantics I have in mind *do not replace* C++’s existing call/return
> or stack behavior; they add a separate, higher-level description of
> evaluation order.
> >
> >
> > 3. Calling conventions and ABI
> >
> > You are absolutely right that anything that /directly/ bypasses
> call/return and ABI would be very problematic.
> >
> > To be clear: I am *not* proposing that C++ programmers manually
> circumvent calling conventions or stack protocols. Any eventual
> implementation in a conforming C++ implementation would still:
> >
> > *
> >
> > obey the platform ABI,
> >
> > *
> >
> > use normal calls/returns or coroutine resumption under the hood, or
> >
> > *
> >
> > use some runtime mechanism that remains conforming.
> >
> > The proposal is about *expressing the sequencing intent in the
> language/abstract machine*, not about specifying a particular low-level
> jump mechanism. How that sequencing is lowered (calls, trampolines, state
> machines, etc.) would be implementation-defined.
> >
> >
> > 4. Performance evidence (honest status)
> >
> > Right now, this is at the /research / exploratory/ stage:
> >
> > *
> >
> > The idea is motivated by internal experiments in event-driven
> simulation workloads, where dispatcher-style control-flow becomes a
> bottleneck.
> >
> > *
> >
> > I do *not* yet have a standardized, portable C++ compiler extension
> that we can benchmark across architectures.
> >
> > *
> >
> > I agree that for this to move beyond “interesting idea”, we will
> need at least one concrete prototype and some performance data.
> >
> > So in that sense, you are absolutely right to ask for measurements: they
> will be essential before any serious standardization attempt. At this point
> I am trying to:
> >
> > *
> >
> > check whether the conceptual direction makes sense to the committee,
> and
> >
> > *
> >
> > learn how to better express it in abstract-machine terms,
> >
> > *
> >
> > before investing in a more complete implementation and measurement
> campaign.
> >
> > Thanks again for the questions — they help me refine both the technical
> and explanatory sides of the proposal.
> >
> >
> >
> > *Best Regards,
> > Kamalesh Lakkampally,*
> > *Founder & CEO*
> > www.chipnadi.com <http://www.chipnadi.com>
> >
> >
> >
> > On Thu, Dec 4, 2025 at 12:44 PM Robin Savonen Söderholm via
> Std-Proposals <std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>> wrote:
> >
> > So if I understand correctly, you want to avoid "returning to the
> callee" and rather just jump to the next "task" (as something similar but
> not quite a function call) in hope that it would speed up the process? And
> you want to dynamically change what ever is next on the queue?
> > If so, how does this differ from e.g. "goto"? And can we get in
> trouble because this sounds as we will need to circumvent/handle manually
> things from the calling conventions? I wonder if we really can get so much
> performance from this proposal. Do you gave a (platform specific) example
> project that proves that your particular compiler extension indeed can give
> measurable performance improvements?
> >
> > // Robin
> >
> > On Thu, Dec 4, 2025, 07:43 Kamalesh Lakkampally via Std-Proposals <
> std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>>
> wrote:
> >
> > Hi Marc,
> >
> > Thank you for your comments. Since the mailing list strips
> attachments, you have not seen the core details of the idea, so let me
> restate the proposal from the beginning.
> >
> >
> > 1. *What the proposal is NOT*
> >
> > It is *not*:
> >
> > *
> >
> > a new CPU instruction
> >
> > *
> >
> > a hardware prefetching mechanism
> >
> > *
> >
> > a cache hint
> >
> > *
> >
> > a pipeline control mechanism
> >
> > *
> >
> > an optimization directive
> >
> > *
> >
> > a request for compiler backend changes
> >
> > *
> >
> > a form of parallelism or GPU-style execution
> >
> > None of these describe the concept accurately.
> >
> > The proposal operates *purely at the C++ abstract-machine level*.
> >
> >
> > 2. *What the proposal /is/: A new way to express dynamic
> sequencing of micro-tasks*
> >
> > Many event-driven systems maintain a queue of very small tasks
> ("micro-tasks") whose execution order changes /frequently/ at runtime:
> >
> > At one moment: T1 → T2 → T3
> > Later: T3 → T4 → T1 → T2
> >
> > In C++ today, these systems must route control through a
> dispatcher:
> >
> > dispatcher();
> > → T1();
> > dispatcher();
> > → T2();
> > dispatcher();
> > → T3();
> >
> > Even though the /intended/program order is simply:
> >
> > T1 → T2 → T3
> >
> > This repeated dispatcher → task → dispatcher pattern:
> >
> > *
> >
> > is semantically unnecessary
> >
> > *
> >
> > consumes execution bandwidth
> >
> > *
> >
> > prevents compiler optimization
> >
> > *
> >
> > introduces unpredictable control flow
> >
> > *
> >
> > creates overhead for extremely fine-grained tasks
> >
> > The proposal asks:
> >
> > *Can C++ express dynamic sequencing declaratively, without
> requiring the program to re-enter the dispatcher between every micro-task?*
> >
> >
> > 3. *The key idea: “Fetch-only operations”*
> >
> > A /fetch-only operation/ is a C++ semantic construct that:
> >
> >
> > ✔ does NOT compute
> >
> >
> > ✔ does NOT read or write memory
> >
> >
> > ✔ does NOT have observable side effects
> >
> >
> > ✔ does NOT correspond to a function call or branch
> >
> >
> > ✔ EXISTS ONLY to describe “what executes next”
> >
> > In other words, it is a *pure sequencing directive*, not an
> instruction or computation.
> >
> > For example (placeholder syntax):
> >
> > |fad q[i] = next_address; fcd q[i] = thread_context; fed q[i] =
> exec_context; |
> >
> > These operations place *sequencing metadata*into a dedicated
> structure.
> >
> >
> > 4. *What metadata is being represented?*
> >
> > Each “micro-task” is associated with small context fields:
> >
> > *
> >
> > *8-bit thread-context*
> >
> > *
> >
> > *8-bit execution-context*
> >
> > *
> >
> > *an instruction address (or function entry)*
> >
> > These fields allow the program to encode:
> >
> > “After completing this task, the /next/ task to evaluate is
> at address X,
> > but only if its context matches Y.”
> >
> > This enables the expression of dynamic scheduling decisions
> *without returning through the dispatcher*.
> >
> >
> > 5. *Fetch-Only Region: where this metadata lives*
> >
> > Just as C++ programs conceptually have:
> >
> > *
> >
> > a *stack region* (automatic storage)
> >
> > *
> >
> > a *heap region* (dynamic storage)
> >
> > the proposal introduces a *fetch-only region*:
> >
> > *
> >
> > memory that stores sequencing metadata
> >
> > *
> >
> > strictly controlled by the implementation--> MMU(memory
> management unit)
> >
> > *
> >
> > with context validation
> >
> > *
> >
> > not accessible for ordinary loads/stores
> >
> > *
> >
> > used only for sequencing, not computation
> >
> > This region is *not hardware-specific*; it is an
> abstract-machine concept, much like the thread-local storage model or
> atomic synchronization regions.
> >
> >
> > 6. *Why this belongs at the language level*
> >
> > This is not expressible today because C++ has *no construct to
> describe sequencing separately from execution*.
> >
> > Existing mechanisms:
> >
> >
> > Threads / tasks / executors
> >
> > → Require execution-path transitions
> >
> >
> > Coroutines
> >
> > → Maintain suspended execution frames
> >
> >
> > Function calls
> >
> > → Require call/return semantics
> >
> >
> > Dispatch loops
> >
> > → Centralize sequencing and exhibit overhead
> >
> >
> > As-if rule
> >
> > → Cannot remove dispatcher calls; they are semantically required
> >
> > Dynamic sequencing is fundamentally:
> >
> > *
> >
> > *not parallelism*,
> >
> > *
> >
> > *not computation*,
> >
> > *
> >
> > *not scheduling*,
> >
> > *
> >
> > *not hardware control*.
> >
> > It is *language-level intent*:
> >
> > “Evaluate micro-tasks in this dynamic order, without routing
> control back through a dispatcher.”
> >
> > This cannot be expressed with intrinsics, LLVM passes, or
> target-specific code.
> >
> >
> > 7. *Why this is portable*
> >
> > Different compilers/runtimes/architectures may choose different
> implementations:
> >
> > *
> >
> > pure software interpreter for sequencing
> >
> > *
> >
> > optimizing JIT
> >
> > *
> >
> > coroutine resumption graph
> >
> > *
> >
> > architecture-specific fast paths (optional)
> >
> > *
> >
> > or normal dispatch loops as fallback
> >
> > The semantics remain:
> >
> > *Fetch-only operations provide sequencing metadata
> > fetch-only region stores it
> > execution proceeds in the declared order.*
> >
> > This is a valid addition to the C++ abstract machine, not a
> hardware feature.
> >
> >
> > *Why context fields matter*
> >
> > Context metadata enables:
> >
> > *
> >
> > *correct sequencing:* ensuring that only valid successor
> tasks are chosen
> >
> > *
> >
> > *safety checks:* preventing unintended jumps to unrelated
> micro-tasks
> >
> > *
> >
> > *structural integrity:* maintaining a well-defined
> evaluation graph
> >
> >
> > *Security aspect*
> >
> > Because the sequencing structure is stored in a dedicated
> *fetch-only region*—conceptually similar to the way the stack and heap
> represent distinct memory roles—the context fields also allow:
> >
> > *
> >
> > *validation of allowed transitions*,
> >
> > *
> >
> > *prevention of unauthorized or accidental modification*, and
> >
> > *
> >
> > *protection against control-flow corruption.*
> >
> > In other words, the combination of:
> >
> > *
> >
> > *context identifiers*, and
> >
> > *
> >
> > *a dedicated fetch-only region (analogous to stack/heap
> regions)*
> >
> > provides a framework in which implementations can enforce both
> correctness and security properties for dynamic sequencing.
> >
> > This occurs entirely at the semantic level; no explicit hardware
> behavior is assumed.
> >
> >
> > 8. *Why the proposal exists*
> >
> > Highly dynamic micro-task workloads (event-driven simulation is
> just one example) cannot currently express their intent in C++ without:
> >
> > *
> >
> > repeated dispatcher calls
> >
> > *
> >
> > unnecessary control-flow redirections
> >
> > *
> >
> > significant overhead for fine-grained scheduling
> >
> > This proposal explores whether C++ can support such workloads in
> a portable and declarative manner.
> >
> >
> > 9. *Still early-stage*
> >
> > This is an R0 exploratory draft. I am refining terminology,
> abstract-machine semantics, and examples with the help of feedback from
> this discussion.
> >
> >
> >
> > *Best Regards,
> > Kamalesh Lakkampally,*
> > *Founder & CEO*
> > www.chipnadi.com <http://www.chipnadi.com>
> >
> >
> >
> > On Thu, Dec 4, 2025 at 11:49 AM Marc Edouard Gauthier via
> Std-Proposals <std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>> wrote:
> >
> > Kamalesh,____
> >
> > __ __
> >
> > It’s not clear that you’re looking to change anything at all
> in the language. If you are, you haven’t said exactly what it is.____
> >
> > __ __
> >
> > It seems more that you have an unusual highly parallel
> hardware architecture, and that what you’re looking for is a very different
> compiler implementation, not a different or modified language.____
> >
> > For example, in C++ or most any procedural language, you can
> write some sequence of independent steps:____
> >
> > __ __
> >
> > a = b + c;____
> >
> > d = e * f;____
> >
> > g = 3 + 5 / h;____
> >
> > __ __
> >
> > The compiler (compiler backend typically) is totally free to
> reorder these, or dispatch these, in whatever way the underlying hardware
> architecture allows.____
> >
> > If there are dependencies, such as say, `k = a – 1;` the
> compiler ensures operations are ordered to satisfy the dependency, in
> whatever way the hardware architecture allows for.____
> >
> > __ __
> >
> > So it seems already possible to express “micro-tasks”,
> whatever these might be, as simple independent C++ statements.____
> >
> > __ __
> >
> > Assuming you have a novel computer hardware architecture,
> and you want compiler support for it, your time is very likely much better
> spent studying LLVM and how to port it to a new architecture, than trying
> to propose things to this group without understanding what you’re
> proposing.____
> >
> > You may also find different languages or language extensions
> (many supported by LLVM) that help targeting highly parallel hardware such
> as GPUs, that perhaps better fit your hardware architecture.____
> >
> > __ __
> >
> > At least you might come out of that exercise with much
> better understanding of the relationship between your hardware and
> compilers.____
> >
> > __ __
> >
> > (My 2 cents.)____
> >
> > __ __
> >
> > Marc____
> >
> > __ __
> >
> > __ __
> >
> > *From:*Std-Proposals <std-proposals-bounces_at_[hidden]
> <mailto:std-proposals-bounces_at_[hidden]>> *On Behalf Of *Kamalesh
> Lakkampally via Std-Proposals
> > *Sent:* Wednesday, December 3, 2025 21:46
> > *To:* std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>
> > *Cc:* Kamalesh Lakkampally <info_at_[hidden] <mailto:
> info_at_[hidden]>>
> > *Subject:* Re: [std-proposals] Core-Language Extension for
> Fetch-Only Instruction Semantics____
> >
> > __ __
> >
> > Hello Thiago,____
> >
> > Thank you for the thoughtful question — it touches on the
> central issue of whether this idea belongs in the C++ Standard or should
> remain an architecture-specific extension.
> > Let me clarify the motivation more precisely, because the
> original message did not convey the full context.____
> >
> >
> > *1. This proposal is */not/*about prefetching or
> cache-control*____
> >
> > The draft text unfortunately used terminology that sounded
> hardware-adjacent, but the intent is /not/ to introduce anything analogous
> to:____
> >
> > * prefetch intrinsics____
> > * non-temporal loads/stores____
> > * cache control hints____
> > * pipeline fetch controls____
> >
> > Those are ISA-level optimizations and naturally belong in
> compiler extensions or architecture-specific intrinsics.____
> >
> > The concept here is completely different and exists at the
> *semantic level*, not the CPU-microarchitecture level.____
> >
> > __ __
> >
> >
> > *2. The actual concept: explicit sequencing of
> dynamically changing micro-tasks*____
> >
> > Many event-driven systems (HDL simulators are just one
> example) share a common execution model:____
> >
> > * Thousands of /micro-tasks/____
> > * The order of tasks is computed dynamically____
> > * The order changes every cycle or even more frequently____
> > * Tasks themselves are small and often side-effect-free____
> > * A dispatcher function repeatedly selects the next task
> to run____
> >
> > In C++ today, this typically results in a control-flow
> pattern like:____
> >
> > dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …____
> >
> > even when the /intended evaluation sequence/ is
> conceptually:____
> >
> > T1 → T2 → T3 → T4 → …____
> >
> > The key issue is *expressiveness*: C++ currently has no
> mechanism to express “evaluate these things in this dynamic order, without
> re-entering a dispatcher between each step”.____
> >
> > Coroutines, tasks, executors, thread pools, and dispatch
> loops all still fundamentally operate through:____
> >
> > * repeated function calls, or____
> > * repeated returns to a central controller, or____
> > * runtime-managed schedulers____
> >
> > which means that as programs scale down to extremely
> fine-grained tasks, *sequencing overhead becomes dominant*, even on
> architectures where prefetching is not a concern.____
> >
> > __ __
> >
> >
> > *3. Why this is not architecture-specific*____
> >
> > The misunderstanding arises when “fetch-only operation”
> sounds like a CPU fetch-stage mechanism.____
> >
> > The actual idea is:____
> >
> > A mechanism in the abstract machine that allows a
> program to express a /sequence of evaluations/ that does not pass through a
> central dispatcher after each step.____
> >
> > This can be implemented portably in many ways:____
> >
> > * Using a software-run interpreter that consumes the
> sequencing structure____
> > * Using an implementation-specific optimization when
> available____
> > * Mapping to architecture-specific controls on processors
> that support such mechanisms____
> > * Or completely lowering into normal calls on simpler
> hardware____
> >
> > This is similar in spirit to:____
> >
> > * *coroutines*: abstract machine semantics, many possible
> implementations____
> > * *atomics*: abstract semantics, architecture maps them to
> whatever operations it has____
> > * *SIMD types*: portable semantics, mapped to different
> instructions per architecture____
> > * *executors*: abstract relationships, many possible
> backends____
> >
> > So the goal is not to standardize “fetch-stage hardware
> instructions”,
> > but to explore whether C++ can expose a *portable semantic
> form of sequencing that compilers may optimize very differently depending
> on platform capabilities*.____
> >
> > __ __
> >
> >
> > *4. Why prefetch intrinsics are insufficient*____
> >
> > Prefetch intrinsics provide:____
> >
> > * data locality hints____
> > * non-temporal load/store hints____
> > * per-architecture micro-optimizations____
> >
> > They do *not* provide:____
> >
> > * a semantic representation of evaluation order____
> > * a way to represent a dynamically computed schedule____
> > * a portable abstraction____
> > * a way to eliminate dispatcher re-entry in the program
> model____
> >
> > Prefetching does not remove the repeated calls/returns
> between tasks.
> > This proposal focuses on /expressing intent/, not on cache
> behavior.____
> >
> >
> > *5. Why this might belong in the Standard*____
> >
> > Because the idea is:____
> >
> > * semantic, not architectural____
> > * portable across CPU, GPU, and FPGA-style systems____
> > * potentially optimizable by compilers____
> > * relevant to domains beyond HDL tools____
> > * conceptually related to execution agents, coroutines,
> and sequencing in executors____
> > * about exposing user intent (dynamic ordering), not
> hardware control____
> >
> > Many modern workloads — including simulation, actor
> frameworks, reactive graph engines, and fine-grained schedulers — could
> benefit from a portable way to express:____
> >
> > “Here is the next unit of evaluation; no need to return
> to a dispatcher.”____
> >
> > even if the implementation varies drastically per target
> platform.____
> >
> > __ __
> >
> >
> > *6. Early-stage nature*____
> >
> > This is still an early R0 exploratory draft.
> > I fully expect that the idea will require:____
> >
> > * reframing in abstract-machine terminology____
> > * better examples____
> > * clarification of how sequencing is expressed____
> > * exploration of implementability on real compilers____
> >
> > I appreciate your question because it helps anchor the
> discussion in the right conceptual layer.____
> >
> > Thank you again for engaging — your perspective is extremely
> valuable.____
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via
> Std-Proposals <std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>> wrote:____
> >
> > On Wednesday, 3 December 2025 03:38:57 Pacific Standard
> Time Kamalesh
> > Lakkampally via Std-Proposals wrote:
> > > The goal is to support workloads where the execution
> order of micro-tasks
> > > changes dynamically and unpredictably every cycle,
> such as *event-driven
> > > HDL/SystemVerilog simulation*.
> > > In such environments, conventional C++ mechanisms
> (threads, coroutines,
> > > futures, indirect calls, executors) incur significant
> pipeline redirection
> > > penalties. Fetch-only instructions aim to address this
> problem in a
> > > structured, language-visible way.
> > >
> > > I would greatly appreciate *feedback, criticism, and
> suggestions* from the
> > > community.
> > > I am also *open to collaboration.*
> >
> > Can you explain why this should be in the Standard? Why
> are the prefetch
> > intrinsics available as compiler extensions for a lot of
> architectures not
> > enough? My first reaction is that this type of design is
> going to be very
> > architecture-specific by definition, so using
> architecture-specific extensions
> > should not be an impediment.
> >
> > --
> > Thiago Macieira - thiago (AT) macieira.info <
> https://urldefense.us/v3/__http:/macieira.info__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfiyjPrE3A$>
> - thiago (AT) kde.org <
> https://urldefense.us/v3/__http:/kde.org__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfhL3Q_pjg$
> >
> > Principal Engineer - Intel Data Center - Platform &
> Sys. Eng.
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> >
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <
> https://urldefense.us/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/std-proposals__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfipwWQT2w$
> >____
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
> >
> >
>
>
Today, in standard C++, the control-flow structure of a function is
always *statically
bounded*.
Even if there are multiple possible branches, all possible successor
functions are known at compile-time.
For example:
function1 → function2 → function3 ↘ function4
Here the compiler knows that function2 can transfer control only to:
-
function3, or
-
function4.
This is a fundamental assumption behind C++:
*All possible control-flow destinations are statically known, even if the
choice is dynamic.*
But in my case , successor is not statically bounded .
The model I’m working with looks like this:
function1 → function2 → (next function is chosen at runtime)
At runtime, *not at compile time*, the program may write the address of:
-
function3, or
-
function4, or
-
functionN, or
-
any function from a dynamically created set.
In other words:
*The set of possible successors is not known statically.It is constructed
at runtime and may vary extremely frequently.*
This is very different from C++’s normal branching model, where all
successors are part of the program’s fixed control-flow graph.
*1. “Fetch-only” is indeed the wrong name*
I agree.
The term brought in the wrong mental model, and it suggested architectural
behavior that is not part of the proposal.
A much better framing is:
*a declarative sequencing construct (a form of novel control flow)*
that specifies *what should run next*,
without executing computation itself and without the semantics of a
function call.
I’ll adopt clearer terminology going forward.
*2. Why not goto, longjmp, or fibers?*
Yes, the proposal tries to describe cross-function control transfer, but:
-
goto is intra-function only
-
longjmp bypasses stack frames, destroys local state, and is control-flow
destructive
-
fibers/coroutines package execution state and resume points, not *pure
sequencing metadata*
The key distinction is:
*I am not trying to transfer execution state; I am trying to express an
evaluation schedule that is constructed at runtime.*
This is an important difference:
-
coroutines resume captured execution stacks
-
longjmp unwinds stacks
-
goto jumps within a function
But I want to express:
“Task T2 should run next, because scheduling metadata says so —
not because T2 is lexically next or a continuation of T1.”
It is closer to a *runtime-constructed control-flow graph*, not a call
stack manipulation.
*3. The 8-bit context mention was a mistake*
You are absolutely right — fixed-width numbers do not belong in a proposal
about abstract-machine semantics.
The correct description should be:
*Tasks may optionally be annotated with small, implementation-defined
context identifiers that sequencing logic can match against.*
Their width, representation, and meaning would be entirely
implementation-defined.
I should not have written “8-bit” and will correct that.
*4. What does “declarative sequencing” actually mean?*
By “declarative,” I mean:
The program provides *metadata describing which task should run next*,
while the implementation chooses *how* to realize that sequence.
So dynamic influence is allowed — and intended — but the sequencing
construct itself:
-
does not compute
-
does not load or store
-
does not participate in side effects
-
only names the next evaluation target
This is separate from the execution of the tasks themselves.
*5. Yes, tasks may insert new tasks into the schedule*
This is central to the use case.
To answer your question:
*Can a task T1 insert T5 into the schedule?*
Yes — that is exactly the model.
The sequence is *not* fixed; it is *runtime-constructed and frequently
updated*.
This is what distinguishes it from unrolled loops or statically generated
dispatch chains.
The metadata structure effectively becomes:
-
a runtime-modifiable list, or
-
small graph, or
-
chain of successor relationships
built out of ordinary C++ logic.
The proposal is *not* about the container; it is about giving the
implementation freedom to run that schedule without repeatedly routing
through a dispatcher.
*6. Where this differs from existing mechanisms*
Today C++ implements dynamic scheduling with:
while (!q.empty()) {
auto t = q.pop();
t(); // ordinary function call via dispatcher
}
But this carries:
-
call/return overhead
-
mandatory ABI boundaries
-
loss of optimization opportunities (opaque function pointer calls)
-
repeated re-entry into the dispatcher
What I want to explore is whether C++ could have:
*a side-effect-free construct that expresses the sequence itself*
giving implementations permission to execute the sequence more directly.
This is closer to an *explicit, runtime-defined control-flow path*,
not a request for new hardware or new stack semantics.
*Best Regards,Kamalesh Lakkampally,*
*Founder & CEO*
www.chipnadi.com
On Thu, Dec 4, 2025 at 6:39 PM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
> First of all, calling this a "fetch-only" thing is a huge misnomer
> for me. There are lots of things in C++ that influence control flow
> only, but don't (e.g.) modify state or similar. Such as a while loop.
> Call it "novel control flow" if you wish.
>
> You've addressed the question why "goto" doesn't work for you
> (need cross-function), so maybe "longjmp" is a better model for
> you? We also have stateful coroutines in the making (look for
> "fibers"), which might help you.
>
> Talking about "8-bit" in a proposal purely about the abstract machine
> is entirely unhelpful; we don't specify the bit-width of an "int",
> and we certainly should not specify the bit-width of something that
> can hold a thread context handle. (I have more than 256 threads,
> for sure.)
>
> You're talking about "declarative", but I don't understand how
> much dynamic influence you want to have.
>
> Can a task T1 enter another task T5 into the queue, for later
> execution? If not, how do those tasks come into being in
> the first place?
>
> Jens
>
>
>
> On 12/4/25 09:15, Kamalesh Lakkampally via Std-Proposals wrote:
> > Hi Robin,
> >
> > Thank you, that’s a very helpful way to frame the questions.
> >
> >
> > 1. What you understood correctly
> >
> > Yes, at a high level you understood the intention:
> >
> > *
> >
> > I want to avoid repeatedly “returning to the dispatcher” (or central
> caller) between micro-tasks.
> >
> > *
> >
> > I want the program to be able to describe a sequence of tasks that
> can change dynamically at runtime, and then have execution proceed along
> that sequence directly.
> >
> > So instead of:
> >
> > dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → ...
> >
> > the /intent/ is to express:
> > T1 → T2 → T3 → ...
> >
> >
> > where the exact order is computed dynamically.
> >
> >
> > 2. Why this is not just |goto|
> >
> > A |goto| in C++:
> >
> > *
> >
> > is *local to a function*
> >
> > *
> >
> > jumps to a label in the /same/ function
> >
> > *
> >
> > still lives entirely within the usual call/return and stack
> discipline
> >
> > *
> >
> > does not encode any higher-level scheduling or context
> >
> > What I am exploring is /not/ a new low-level transfer like |goto|, but a
> way to:
> >
> > *
> >
> > build a *sequence of tasks* (potentially different functions or
> micro-tasks),
> >
> > *
> >
> > optionally annotated with small context identifiers,
> >
> > *
> >
> > that a runtime can then walk through as the next evaluation steps.
> >
> > So it’s closer to:
> >
> > *
> >
> > “a declarative schedule of what to run next”
> >
> > than to:
> >
> > *
> >
> > “a raw jump from one instruction to another.”
> >
> > The semantics I have in mind *do not replace* C++’s existing call/return
> or stack behavior; they add a separate, higher-level description of
> evaluation order.
> >
> >
> > 3. Calling conventions and ABI
> >
> > You are absolutely right that anything that /directly/ bypasses
> call/return and ABI would be very problematic.
> >
> > To be clear: I am *not* proposing that C++ programmers manually
> circumvent calling conventions or stack protocols. Any eventual
> implementation in a conforming C++ implementation would still:
> >
> > *
> >
> > obey the platform ABI,
> >
> > *
> >
> > use normal calls/returns or coroutine resumption under the hood, or
> >
> > *
> >
> > use some runtime mechanism that remains conforming.
> >
> > The proposal is about *expressing the sequencing intent in the
> language/abstract machine*, not about specifying a particular low-level
> jump mechanism. How that sequencing is lowered (calls, trampolines, state
> machines, etc.) would be implementation-defined.
> >
> >
> > 4. Performance evidence (honest status)
> >
> > Right now, this is at the /research / exploratory/ stage:
> >
> > *
> >
> > The idea is motivated by internal experiments in event-driven
> simulation workloads, where dispatcher-style control-flow becomes a
> bottleneck.
> >
> > *
> >
> > I do *not* yet have a standardized, portable C++ compiler extension
> that we can benchmark across architectures.
> >
> > *
> >
> > I agree that for this to move beyond “interesting idea”, we will
> need at least one concrete prototype and some performance data.
> >
> > So in that sense, you are absolutely right to ask for measurements: they
> will be essential before any serious standardization attempt. At this point
> I am trying to:
> >
> > *
> >
> > check whether the conceptual direction makes sense to the committee,
> and
> >
> > *
> >
> > learn how to better express it in abstract-machine terms,
> >
> > *
> >
> > before investing in a more complete implementation and measurement
> campaign.
> >
> > Thanks again for the questions — they help me refine both the technical
> and explanatory sides of the proposal.
> >
> >
> >
> > *Best Regards,
> > Kamalesh Lakkampally,*
> > *Founder & CEO*
> > www.chipnadi.com <http://www.chipnadi.com>
> >
> >
> >
> > On Thu, Dec 4, 2025 at 12:44 PM Robin Savonen Söderholm via
> Std-Proposals <std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>> wrote:
> >
> > So if I understand correctly, you want to avoid "returning to the
> callee" and rather just jump to the next "task" (as something similar but
> not quite a function call) in hope that it would speed up the process? And
> you want to dynamically change what ever is next on the queue?
> > If so, how does this differ from e.g. "goto"? And can we get in
> trouble because this sounds as we will need to circumvent/handle manually
> things from the calling conventions? I wonder if we really can get so much
> performance from this proposal. Do you gave a (platform specific) example
> project that proves that your particular compiler extension indeed can give
> measurable performance improvements?
> >
> > // Robin
> >
> > On Thu, Dec 4, 2025, 07:43 Kamalesh Lakkampally via Std-Proposals <
> std-proposals_at_[hidden] <mailto:std-proposals_at_[hidden]>>
> wrote:
> >
> > Hi Marc,
> >
> > Thank you for your comments. Since the mailing list strips
> attachments, you have not seen the core details of the idea, so let me
> restate the proposal from the beginning.
> >
> >
> > 1. *What the proposal is NOT*
> >
> > It is *not*:
> >
> > *
> >
> > a new CPU instruction
> >
> > *
> >
> > a hardware prefetching mechanism
> >
> > *
> >
> > a cache hint
> >
> > *
> >
> > a pipeline control mechanism
> >
> > *
> >
> > an optimization directive
> >
> > *
> >
> > a request for compiler backend changes
> >
> > *
> >
> > a form of parallelism or GPU-style execution
> >
> > None of these describe the concept accurately.
> >
> > The proposal operates *purely at the C++ abstract-machine level*.
> >
> >
> > 2. *What the proposal /is/: A new way to express dynamic
> sequencing of micro-tasks*
> >
> > Many event-driven systems maintain a queue of very small tasks
> ("micro-tasks") whose execution order changes /frequently/ at runtime:
> >
> > At one moment: T1 → T2 → T3
> > Later: T3 → T4 → T1 → T2
> >
> > In C++ today, these systems must route control through a
> dispatcher:
> >
> > dispatcher();
> > → T1();
> > dispatcher();
> > → T2();
> > dispatcher();
> > → T3();
> >
> > Even though the /intended/program order is simply:
> >
> > T1 → T2 → T3
> >
> > This repeated dispatcher → task → dispatcher pattern:
> >
> > *
> >
> > is semantically unnecessary
> >
> > *
> >
> > consumes execution bandwidth
> >
> > *
> >
> > prevents compiler optimization
> >
> > *
> >
> > introduces unpredictable control flow
> >
> > *
> >
> > creates overhead for extremely fine-grained tasks
> >
> > The proposal asks:
> >
> > *Can C++ express dynamic sequencing declaratively, without
> requiring the program to re-enter the dispatcher between every micro-task?*
> >
> >
> > 3. *The key idea: “Fetch-only operations”*
> >
> > A /fetch-only operation/ is a C++ semantic construct that:
> >
> >
> > ✔ does NOT compute
> >
> >
> > ✔ does NOT read or write memory
> >
> >
> > ✔ does NOT have observable side effects
> >
> >
> > ✔ does NOT correspond to a function call or branch
> >
> >
> > ✔ EXISTS ONLY to describe “what executes next”
> >
> > In other words, it is a *pure sequencing directive*, not an
> instruction or computation.
> >
> > For example (placeholder syntax):
> >
> > |fad q[i] = next_address; fcd q[i] = thread_context; fed q[i] =
> exec_context; |
> >
> > These operations place *sequencing metadata*into a dedicated
> structure.
> >
> >
> > 4. *What metadata is being represented?*
> >
> > Each “micro-task” is associated with small context fields:
> >
> > *
> >
> > *8-bit thread-context*
> >
> > *
> >
> > *8-bit execution-context*
> >
> > *
> >
> > *an instruction address (or function entry)*
> >
> > These fields allow the program to encode:
> >
> > “After completing this task, the /next/ task to evaluate is
> at address X,
> > but only if its context matches Y.”
> >
> > This enables the expression of dynamic scheduling decisions
> *without returning through the dispatcher*.
> >
> >
> > 5. *Fetch-Only Region: where this metadata lives*
> >
> > Just as C++ programs conceptually have:
> >
> > *
> >
> > a *stack region* (automatic storage)
> >
> > *
> >
> > a *heap region* (dynamic storage)
> >
> > the proposal introduces a *fetch-only region*:
> >
> > *
> >
> > memory that stores sequencing metadata
> >
> > *
> >
> > strictly controlled by the implementation--> MMU(memory
> management unit)
> >
> > *
> >
> > with context validation
> >
> > *
> >
> > not accessible for ordinary loads/stores
> >
> > *
> >
> > used only for sequencing, not computation
> >
> > This region is *not hardware-specific*; it is an
> abstract-machine concept, much like the thread-local storage model or
> atomic synchronization regions.
> >
> >
> > 6. *Why this belongs at the language level*
> >
> > This is not expressible today because C++ has *no construct to
> describe sequencing separately from execution*.
> >
> > Existing mechanisms:
> >
> >
> > Threads / tasks / executors
> >
> > → Require execution-path transitions
> >
> >
> > Coroutines
> >
> > → Maintain suspended execution frames
> >
> >
> > Function calls
> >
> > → Require call/return semantics
> >
> >
> > Dispatch loops
> >
> > → Centralize sequencing and exhibit overhead
> >
> >
> > As-if rule
> >
> > → Cannot remove dispatcher calls; they are semantically required
> >
> > Dynamic sequencing is fundamentally:
> >
> > *
> >
> > *not parallelism*,
> >
> > *
> >
> > *not computation*,
> >
> > *
> >
> > *not scheduling*,
> >
> > *
> >
> > *not hardware control*.
> >
> > It is *language-level intent*:
> >
> > “Evaluate micro-tasks in this dynamic order, without routing
> control back through a dispatcher.”
> >
> > This cannot be expressed with intrinsics, LLVM passes, or
> target-specific code.
> >
> >
> > 7. *Why this is portable*
> >
> > Different compilers/runtimes/architectures may choose different
> implementations:
> >
> > *
> >
> > pure software interpreter for sequencing
> >
> > *
> >
> > optimizing JIT
> >
> > *
> >
> > coroutine resumption graph
> >
> > *
> >
> > architecture-specific fast paths (optional)
> >
> > *
> >
> > or normal dispatch loops as fallback
> >
> > The semantics remain:
> >
> > *Fetch-only operations provide sequencing metadata
> > fetch-only region stores it
> > execution proceeds in the declared order.*
> >
> > This is a valid addition to the C++ abstract machine, not a
> hardware feature.
> >
> >
> > *Why context fields matter*
> >
> > Context metadata enables:
> >
> > *
> >
> > *correct sequencing:* ensuring that only valid successor
> tasks are chosen
> >
> > *
> >
> > *safety checks:* preventing unintended jumps to unrelated
> micro-tasks
> >
> > *
> >
> > *structural integrity:* maintaining a well-defined
> evaluation graph
> >
> >
> > *Security aspect*
> >
> > Because the sequencing structure is stored in a dedicated
> *fetch-only region*—conceptually similar to the way the stack and heap
> represent distinct memory roles—the context fields also allow:
> >
> > *
> >
> > *validation of allowed transitions*,
> >
> > *
> >
> > *prevention of unauthorized or accidental modification*, and
> >
> > *
> >
> > *protection against control-flow corruption.*
> >
> > In other words, the combination of:
> >
> > *
> >
> > *context identifiers*, and
> >
> > *
> >
> > *a dedicated fetch-only region (analogous to stack/heap
> regions)*
> >
> > provides a framework in which implementations can enforce both
> correctness and security properties for dynamic sequencing.
> >
> > This occurs entirely at the semantic level; no explicit hardware
> behavior is assumed.
> >
> >
> > 8. *Why the proposal exists*
> >
> > Highly dynamic micro-task workloads (event-driven simulation is
> just one example) cannot currently express their intent in C++ without:
> >
> > *
> >
> > repeated dispatcher calls
> >
> > *
> >
> > unnecessary control-flow redirections
> >
> > *
> >
> > significant overhead for fine-grained scheduling
> >
> > This proposal explores whether C++ can support such workloads in
> a portable and declarative manner.
> >
> >
> > 9. *Still early-stage*
> >
> > This is an R0 exploratory draft. I am refining terminology,
> abstract-machine semantics, and examples with the help of feedback from
> this discussion.
> >
> >
> >
> > *Best Regards,
> > Kamalesh Lakkampally,*
> > *Founder & CEO*
> > www.chipnadi.com <http://www.chipnadi.com>
> >
> >
> >
> > On Thu, Dec 4, 2025 at 11:49 AM Marc Edouard Gauthier via
> Std-Proposals <std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>> wrote:
> >
> > Kamalesh,____
> >
> > __ __
> >
> > It’s not clear that you’re looking to change anything at all
> in the language. If you are, you haven’t said exactly what it is.____
> >
> > __ __
> >
> > It seems more that you have an unusual highly parallel
> hardware architecture, and that what you’re looking for is a very different
> compiler implementation, not a different or modified language.____
> >
> > For example, in C++ or most any procedural language, you can
> write some sequence of independent steps:____
> >
> > __ __
> >
> > a = b + c;____
> >
> > d = e * f;____
> >
> > g = 3 + 5 / h;____
> >
> > __ __
> >
> > The compiler (compiler backend typically) is totally free to
> reorder these, or dispatch these, in whatever way the underlying hardware
> architecture allows.____
> >
> > If there are dependencies, such as say, `k = a – 1;` the
> compiler ensures operations are ordered to satisfy the dependency, in
> whatever way the hardware architecture allows for.____
> >
> > __ __
> >
> > So it seems already possible to express “micro-tasks”,
> whatever these might be, as simple independent C++ statements.____
> >
> > __ __
> >
> > Assuming you have a novel computer hardware architecture,
> and you want compiler support for it, your time is very likely much better
> spent studying LLVM and how to port it to a new architecture, than trying
> to propose things to this group without understanding what you’re
> proposing.____
> >
> > You may also find different languages or language extensions
> (many supported by LLVM) that help targeting highly parallel hardware such
> as GPUs, that perhaps better fit your hardware architecture.____
> >
> > __ __
> >
> > At least you might come out of that exercise with much
> better understanding of the relationship between your hardware and
> compilers.____
> >
> > __ __
> >
> > (My 2 cents.)____
> >
> > __ __
> >
> > Marc____
> >
> > __ __
> >
> > __ __
> >
> > *From:*Std-Proposals <std-proposals-bounces_at_[hidden]
> <mailto:std-proposals-bounces_at_[hidden]>> *On Behalf Of *Kamalesh
> Lakkampally via Std-Proposals
> > *Sent:* Wednesday, December 3, 2025 21:46
> > *To:* std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>
> > *Cc:* Kamalesh Lakkampally <info_at_[hidden] <mailto:
> info_at_[hidden]>>
> > *Subject:* Re: [std-proposals] Core-Language Extension for
> Fetch-Only Instruction Semantics____
> >
> > __ __
> >
> > Hello Thiago,____
> >
> > Thank you for the thoughtful question — it touches on the
> central issue of whether this idea belongs in the C++ Standard or should
> remain an architecture-specific extension.
> > Let me clarify the motivation more precisely, because the
> original message did not convey the full context.____
> >
> >
> > *1. This proposal is */not/*about prefetching or
> cache-control*____
> >
> > The draft text unfortunately used terminology that sounded
> hardware-adjacent, but the intent is /not/ to introduce anything analogous
> to:____
> >
> > * prefetch intrinsics____
> > * non-temporal loads/stores____
> > * cache control hints____
> > * pipeline fetch controls____
> >
> > Those are ISA-level optimizations and naturally belong in
> compiler extensions or architecture-specific intrinsics.____
> >
> > The concept here is completely different and exists at the
> *semantic level*, not the CPU-microarchitecture level.____
> >
> > __ __
> >
> >
> > *2. The actual concept: explicit sequencing of
> dynamically changing micro-tasks*____
> >
> > Many event-driven systems (HDL simulators are just one
> example) share a common execution model:____
> >
> > * Thousands of /micro-tasks/____
> > * The order of tasks is computed dynamically____
> > * The order changes every cycle or even more frequently____
> > * Tasks themselves are small and often side-effect-free____
> > * A dispatcher function repeatedly selects the next task
> to run____
> >
> > In C++ today, this typically results in a control-flow
> pattern like:____
> >
> > dispatcher → T1 → dispatcher → T2 → dispatcher → T3 → …____
> >
> > even when the /intended evaluation sequence/ is
> conceptually:____
> >
> > T1 → T2 → T3 → T4 → …____
> >
> > The key issue is *expressiveness*: C++ currently has no
> mechanism to express “evaluate these things in this dynamic order, without
> re-entering a dispatcher between each step”.____
> >
> > Coroutines, tasks, executors, thread pools, and dispatch
> loops all still fundamentally operate through:____
> >
> > * repeated function calls, or____
> > * repeated returns to a central controller, or____
> > * runtime-managed schedulers____
> >
> > which means that as programs scale down to extremely
> fine-grained tasks, *sequencing overhead becomes dominant*, even on
> architectures where prefetching is not a concern.____
> >
> > __ __
> >
> >
> > *3. Why this is not architecture-specific*____
> >
> > The misunderstanding arises when “fetch-only operation”
> sounds like a CPU fetch-stage mechanism.____
> >
> > The actual idea is:____
> >
> > A mechanism in the abstract machine that allows a
> program to express a /sequence of evaluations/ that does not pass through a
> central dispatcher after each step.____
> >
> > This can be implemented portably in many ways:____
> >
> > * Using a software-run interpreter that consumes the
> sequencing structure____
> > * Using an implementation-specific optimization when
> available____
> > * Mapping to architecture-specific controls on processors
> that support such mechanisms____
> > * Or completely lowering into normal calls on simpler
> hardware____
> >
> > This is similar in spirit to:____
> >
> > * *coroutines*: abstract machine semantics, many possible
> implementations____
> > * *atomics*: abstract semantics, architecture maps them to
> whatever operations it has____
> > * *SIMD types*: portable semantics, mapped to different
> instructions per architecture____
> > * *executors*: abstract relationships, many possible
> backends____
> >
> > So the goal is not to standardize “fetch-stage hardware
> instructions”,
> > but to explore whether C++ can expose a *portable semantic
> form of sequencing that compilers may optimize very differently depending
> on platform capabilities*.____
> >
> > __ __
> >
> >
> > *4. Why prefetch intrinsics are insufficient*____
> >
> > Prefetch intrinsics provide:____
> >
> > * data locality hints____
> > * non-temporal load/store hints____
> > * per-architecture micro-optimizations____
> >
> > They do *not* provide:____
> >
> > * a semantic representation of evaluation order____
> > * a way to represent a dynamically computed schedule____
> > * a portable abstraction____
> > * a way to eliminate dispatcher re-entry in the program
> model____
> >
> > Prefetching does not remove the repeated calls/returns
> between tasks.
> > This proposal focuses on /expressing intent/, not on cache
> behavior.____
> >
> >
> > *5. Why this might belong in the Standard*____
> >
> > Because the idea is:____
> >
> > * semantic, not architectural____
> > * portable across CPU, GPU, and FPGA-style systems____
> > * potentially optimizable by compilers____
> > * relevant to domains beyond HDL tools____
> > * conceptually related to execution agents, coroutines,
> and sequencing in executors____
> > * about exposing user intent (dynamic ordering), not
> hardware control____
> >
> > Many modern workloads — including simulation, actor
> frameworks, reactive graph engines, and fine-grained schedulers — could
> benefit from a portable way to express:____
> >
> > “Here is the next unit of evaluation; no need to return
> to a dispatcher.”____
> >
> > even if the implementation varies drastically per target
> platform.____
> >
> > __ __
> >
> >
> > *6. Early-stage nature*____
> >
> > This is still an early R0 exploratory draft.
> > I fully expect that the idea will require:____
> >
> > * reframing in abstract-machine terminology____
> > * better examples____
> > * clarification of how sequencing is expressed____
> > * exploration of implementability on real compilers____
> >
> > I appreciate your question because it helps anchor the
> discussion in the right conceptual layer.____
> >
> > Thank you again for engaging — your perspective is extremely
> valuable.____
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > On Wed, Dec 3, 2025 at 10:22 PM Thiago Macieira via
> Std-Proposals <std-proposals_at_[hidden] <mailto:
> std-proposals_at_[hidden]>> wrote:____
> >
> > On Wednesday, 3 December 2025 03:38:57 Pacific Standard
> Time Kamalesh
> > Lakkampally via Std-Proposals wrote:
> > > The goal is to support workloads where the execution
> order of micro-tasks
> > > changes dynamically and unpredictably every cycle,
> such as *event-driven
> > > HDL/SystemVerilog simulation*.
> > > In such environments, conventional C++ mechanisms
> (threads, coroutines,
> > > futures, indirect calls, executors) incur significant
> pipeline redirection
> > > penalties. Fetch-only instructions aim to address this
> problem in a
> > > structured, language-visible way.
> > >
> > > I would greatly appreciate *feedback, criticism, and
> suggestions* from the
> > > community.
> > > I am also *open to collaboration.*
> >
> > Can you explain why this should be in the Standard? Why
> are the prefetch
> > intrinsics available as compiler extensions for a lot of
> architectures not
> > enough? My first reaction is that this type of design is
> going to be very
> > architecture-specific by definition, so using
> architecture-specific extensions
> > should not be an impediment.
> >
> > --
> > Thiago Macieira - thiago (AT) macieira.info <
> https://urldefense.us/v3/__http:/macieira.info__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfiyjPrE3A$>
> - thiago (AT) kde.org <
> https://urldefense.us/v3/__http:/kde.org__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfhL3Q_pjg$
> >
> > Principal Engineer - Intel Data Center - Platform &
> Sys. Eng.
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> >
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <
> https://urldefense.us/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/std-proposals__;!!Fqb0NABsjhF0Kh8I!ds4TmlnUR3HOjbM1tqX9-z0t9wQ_nycW2FAesvQrFCjO_zmCC0zHqKeuUhLgM4zDQODE_PGWjrlB6uvm8sb9qpRHdb1jbfipwWQT2w$
> >____
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> <https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden] <mailto:
> Std-Proposals_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals <
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals>
> >
> >
>
>
Received on 2025-12-04 13:23:36
