> std::atomic<float> and std::atomic<double> are lock-free on major
> implementations, so there's no mutex.  Typically fetch_add and friends are
> implemented with a compare-exchange loop.

Oh, right, I hadn't thought of that. My bad, it was right there.

What this means, though, is that the cost of manipulating the floating point
environment in the processor state is non-negligible.

On x86 without SSE2, it would be expensive x87 instructions to save and reload the state.  Note that the FP state setup doesn’t need to happen within the retry loop.

If the chip has SSE2, the compiler could use something like:

stmxcsr [rsp]
ldmxcsr [value_00001F80]
retry:
movss xmm1, [rdi]
mov eax, [rdi]
addss xmm1, xmm0
movd ecx, xmm1
lock cmpxchg [rdi], ecx
jnz short retry
ldmxcsr [rsp]

LDMXCSR has a 7-cycle latency, but this whole sequence’s timing is dominated by the 100+-cycle lock cmpxchg.

I don’t know anything about the timing for changing the FP state nor the two types of atomic operations (LL-SC form and the newer x86-like CAS instruction).

Melissa