Date: Wed, 23 Aug 2023 22:27:48 -0700
>
>
> > std::atomic<float> and std::atomic<double> are lock-free on major
> > implementations, so there's no mutex. Typically fetch_add and friends
> are
> > implemented with a compare-exchange loop.
>
> Oh, right, I hadn't thought of that. My bad, it was right there.
>
> What this means, though, is that the cost of manipulating the floating
> point
> environment in the processor state is non-negligible.
On x86 without SSE2, it would be expensive x87 instructions to save and
reload the state. Note that the FP state setup doesn’t need to happen
within the retry loop.
If the chip has SSE2, the compiler could use something like:
stmxcsr [rsp]
ldmxcsr [value_00001F80]
retry:
movss xmm1, [rdi]
mov eax, [rdi]
addss xmm1, xmm0
movd ecx, xmm1
lock cmpxchg [rdi], ecx
jnz short retry
ldmxcsr [rsp]
LDMXCSR has a 7-cycle latency, but this whole sequence’s timing is
dominated by the 100+-cycle lock cmpxchg.
I don’t know anything about the timing for changing the FP state nor the
two types of atomic operations (LL-SC form and the newer x86-like CAS
instruction).
Melissa
>
> > std::atomic<float> and std::atomic<double> are lock-free on major
> > implementations, so there's no mutex. Typically fetch_add and friends
> are
> > implemented with a compare-exchange loop.
>
> Oh, right, I hadn't thought of that. My bad, it was right there.
>
> What this means, though, is that the cost of manipulating the floating
> point
> environment in the processor state is non-negligible.
On x86 without SSE2, it would be expensive x87 instructions to save and
reload the state. Note that the FP state setup doesn’t need to happen
within the retry loop.
If the chip has SSE2, the compiler could use something like:
stmxcsr [rsp]
ldmxcsr [value_00001F80]
retry:
movss xmm1, [rdi]
mov eax, [rdi]
addss xmm1, xmm0
movd ecx, xmm1
lock cmpxchg [rdi], ecx
jnz short retry
ldmxcsr [rsp]
LDMXCSR has a 7-cycle latency, but this whole sequence’s timing is
dominated by the 100+-cycle lock cmpxchg.
I don’t know anything about the timing for changing the FP state nor the
two types of atomic operations (LL-SC form and the newer x86-like CAS
instruction).
Melissa
Received on 2023-08-24 05:28:02