Date: Mon, 29 Dec 2025 16:21:34 -0300
On Monday, 29 December 2025 11:17:33 Brasilia Standard Time Andrey Semashev
via Std-Proposals wrote:
> > This is possible but undocumented: all machines capable of AVX have
> > 128-bit
> > atomic data paths. Using the VMOVDQA instruction will also have a much
> > higher performance penalty due to the register file transfers.
>
> It is documented in Intel SDM Volume 3, 10.1.1 Guaranteed Atomic Operations.
Thanks. I didn't know it was documented. I'm sure this was done in response to
compiler people asking for it, because I've seen the discussion in GCC's
bugzilla.
Also do note the next extension isn't documented: all Intel AVX512-capable
processors support 256-bit atomic loads and stores.
> But this is only needed if you place an atomic in read-only memory,
> which is a pretty pointless use case.
Indeed, you can use CMPXCHG16B to perform a load from memory you know to be
read-write (set the new and old values to be the same and the worst that
happens is that you succeed in exchanging a value with itself), but this has
performance problems and the compiler cannot emit that code. Using an atomic
RW operation will pull the cacheline into Exclusive state, instead of leaving
it in Shared (q.v. MESI) or Read-Shared in newer CPUs. This means you should
not spin on loads, because that causes performance penalties for the writers.
If you give the compiler a const std::atomic<__int128> and perform a load(),
it can't know whether the page is read-only or not. It must use a read-only
instruction and the only one available is VMOVDQA/VMOVAPS/VMOVAPD. If the
compiler can see an unconditional write to the same memory, it can reason
backwards that the memory is read-write (time-travelling UB) and thus could
emit the RW operation.
In other words, the PRF transfer cost (3 cycles) might be worth avoiding the
penalty on waiting for the Request For Ownership (RFO) on the cacheline.
via Std-Proposals wrote:
> > This is possible but undocumented: all machines capable of AVX have
> > 128-bit
> > atomic data paths. Using the VMOVDQA instruction will also have a much
> > higher performance penalty due to the register file transfers.
>
> It is documented in Intel SDM Volume 3, 10.1.1 Guaranteed Atomic Operations.
Thanks. I didn't know it was documented. I'm sure this was done in response to
compiler people asking for it, because I've seen the discussion in GCC's
bugzilla.
Also do note the next extension isn't documented: all Intel AVX512-capable
processors support 256-bit atomic loads and stores.
> But this is only needed if you place an atomic in read-only memory,
> which is a pretty pointless use case.
Indeed, you can use CMPXCHG16B to perform a load from memory you know to be
read-write (set the new and old values to be the same and the worst that
happens is that you succeed in exchanging a value with itself), but this has
performance problems and the compiler cannot emit that code. Using an atomic
RW operation will pull the cacheline into Exclusive state, instead of leaving
it in Shared (q.v. MESI) or Read-Shared in newer CPUs. This means you should
not spin on loads, because that causes performance penalties for the writers.
If you give the compiler a const std::atomic<__int128> and perform a load(),
it can't know whether the page is read-only or not. It must use a read-only
instruction and the only one available is VMOVDQA/VMOVAPS/VMOVAPD. If the
compiler can see an unconditional write to the same memory, it can reason
backwards that the memory is read-write (time-travelling UB) and thus could
emit the RW operation.
In other words, the PRF transfer cost (3 cycles) might be worth avoiding the
penalty on waiting for the Request For Ownership (RFO) on the cacheline.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Data Center - Platform & Sys. Eng.
Received on 2025-12-29 19:21:44
