ISOCPP std-proposals List: Re: [std-proposals] Core-Language Extension for Fetch-Only Instruction Semantics

From: Arthur O'Dwyer <arthur.j.odwyer_at_[hidden]>
Date: Wed, 3 Dec 2025 10:36:24 -0500

On Wed, Dec 3, 2025 at 8:50 AM Ville Voutilainen via Std-Proposals <
std-proposals_at_[hidden]> wrote:

>
> Yet C++ has operations, including atomic operations and SIMD
> operations that map directly to hw instructions. So, presumably,
> C++ could have fetch-only operations that map directly to fetch-only
> instructions and regions thereof.
> [...]
> > How would the abstract machine deal with CPU implementation level detail
> specifics? It's the *abstract* machine - not tied to details of how you
> might implement something.
>
> In a fashion similar-ish to how the abstract machine specifies how
> atomic operations work.
>
> >> Fetch-Only Region properties:
> >> - Holds fetch-only metadata (instructions addresses + 8-bit
> thread/execution context).
> >> - MMU-enforced write validation.
> >> - Stricter rules than normal memory.
> >> - Prevents unauthorized metadata forging.
> >
> > How would that region exist? What is it for? This reads like wishful
> thinking or, honestly, keyword vomit.
>
> It would exist for grouping operations that are all (guaranteed to be)
> fetch-only operations. What it's for is for efficient
> immune-to-reordering grouping of such operations.
>

Hm. You're interpreting "region" here as "lexical region" (as opposed to
"memory region")? I think "lexical region" actually makes a ton of sense,
given Kamalesh's later examples involving the bizarre `fad / fcd / fed`
keywords(?); but Kamalesh's original message very clearly bullet-points it
as "A new region in the virtual address space."

Anyway, here's my $.02:

(1) The original message and followups appear to be *at least partly*
nonsense, possibly due to OP's LLM getting confused about the overloaded
meanings of words (like "region"). But that's not to say there's nothing at
all here. And, to be clear, *Kamalesh, please stop using an LLM to generate
your posts.* It is fine to write *your* thoughts in your native language
and then use Google Translate <https://translate.google.com/> to translate
*your* message, but *stop* generating their content.

(2) AIUI, what we're trying to talk about here is instructions like
x86-64's `PREFETCH1` <https://www.felixcloutier.com/x86/prefetchh>, which
says "please prefetch this cache line into my L2 cache," and `PREFETCH0`
<https://www.felixcloutier.com/x86/prefetchh>, which says "please prefetch
this cache line into my L1 cache." These operations conceptually return
void — that is, they instruct the processor to fetch the data "soon,"
without actually waiting for the data to become available. On the flip
side, we also want to think about x86-64's `MOVNTI`
<https://www.felixcloutier.com/x86/movnti> (the "non-temporal" move), which
says "please put this value into memory at this address, but don't bother
to keep it hot; I won't read this cache line again"; and `MOVNTDQA`
<https://www.felixcloutier.com/x86/movntdqa> "please load a value from this
address, but don't bother to keep it hot; I won't read this cache line
again." And then somehow (this is where my understanding really ceases)
x86-64 also has `PREFETCHNTA`
<https://stackoverflow.com/questions/32103968/non-temporal-loads-and-the-hardware-prefetcher-do-they-work-together>,
which says, like, "I'm going to load from this address soon, but only once:
I actually *don't* want it polluting L1 right now, but I'm just letting you
know that I'll be MOVNTDQA'ing it next."
I can imagine an API for dealing with this kind of thing via the same
idioms we have with `atomic_ref`:

    void sneaky_nonpolluting_copy(int *src, int *dst) {
        auto ccsrc = stdx::cache_control_ref<int>(*src);
        ccsrc.prefetch_nontemporal();
        auto ccdst = stdx::cache_control_ref<int>(*dst);
        int value = ccsrc.load_nontemporal();
        ccdst.store_nontemporal(value);
    }

    int add_every_seventh_row_of_a_1024x1024_matrix(double *src) {
        for (int row=0; row < 1024; row += 7) {
            stdx::cache_control_ref<int>(src[1024 * row]).prefetch_l2();
// make every seventh row hot in L2
        }
        double sum = 0;
        for (int row=0; row < 1024; row += 7) {
            sum = std::accumulate(&src[row * 1024], &src[row * 1024] +
1024, sum);
        }
        return sum;
    }

(3) However, this doesn't seem to me like a good fit for the Standard
Library, because it is intensely platform-dependent, and best practices for
using these instructions (AIUI) changes with every new chip. See for
example a discussion of what even is PREFETCHNTA
<https://stackoverflow.com/questions/32103968/non-temporal-loads-and-the-hardware-prefetcher-do-they-work-together>
on StackOverflow. And notice in my fantasy code above, I ended up using
verbs like `prefetch_l2()` because that's what programmers will actually
want to use in practice, if they are able to use this at all. It does no
good to provide "generic" verbs like `.prefetch()` because nobody who uses
this stuff would know what that verb means, and vice versa, nobody who
thinks they "know" what that verb means has any reason to use this stuff.

(4) My wild interpretation of OP's "fad/fcd/fed" syntax is that he was
trying to make "fad" mean "open a block," "fcd" mean "continue the block,"
"fed" mean "last line of the block." The C++ code inside the block would be
ignored, except that the compiler would try to figure out which of its
accesses could be prefetched and would codegen PREFETCH instructions for
them. So his first line
    fad q[i] = address_of(task);
would be equivalent to
    stdx::cache_control_ref(q).prefetch_l1();
    stdx::cache_control_ref(i).prefetch_l1();
(assuming "address_of(task)" was a typo for "std::addressof(task)", in
which case we don't need to fetch anything because we already have the
address of `task` in-hand).
That's my wild guess. Obviously nothing like *that* syntax is ever going to
happen in C++. My syntax in (2) might conceivably happen — I wouldn't put
it past LEWG :P — but as I say in (3), it's not a *good* idea for the
Standard.

(5) If anyone actually thinks a library like `stdx::cache_control_ref` from
(3) would be useful to them, I'd be happy to collaborate/advise on the C++
side of the implementation. I won't fully provide an implementation, both
because I don't know the (non-C++, hardware-level) domain well enough and
because *I* don't think it'd be useful to anyone. But if I'm wrong... happy
to help.

–Arthur

Received on 2025-12-03 15:36:41