Date: Wed, 23 Aug 2023 23:20:47 -0700
On Wednesday, 23 August 2023 22:27:48 PDT Myria via Std-Discussion wrote:
> If the chip has SSE2, the compiler could use something like:
>
> stmxcsr [rsp]
> ldmxcsr [value_00001F80]
> retry:
> movss xmm1, [rdi]
> mov eax, [rdi]
> addss xmm1, xmm0
> movd ecx, xmm1
> lock cmpxchg [rdi], ecx
> jnz short retry
> ldmxcsr [rsp]
>
> LDMXCSR has a 7-cycle latency, but this whole sequence’s timing is
> dominated by the 100+-cycle lock cmpxchg.
CMPXCHG can be much faster than 100 cycles, depending on just how close that
cacheline is to the CPU executing the instruction. If it hasn't lost the
cacheline since the last write operation, it could be as low as 14 cycles.
On the other hand, LDMXCSR is also a memory instruction. Compilers would be
advised to load the 0x1f80 from the stack, thus ensuring locality, but I
wouldn't be surprised to see some generate code that attempts to load a global
0x1f80 because the optimiser figured that storing in the stack ahead of time is
more costly.
The cost of the instruction computed from uops alone may be misleading. This
instruction may cause a pipeline stall depending on what else is happening
with the FP hardware, because I don't know if the state is propagated with the
instructions as they're dispatched. For example,if you did:
x.fetch_add(1.0 / f);
The LDMXCSR may need to wait for the division to end and prevent the addition
from starting early. I'm speculating here; I'll ask the HW architects when I
get the chance.
Don't forget that the moves between register files (the MOVD instructions) are
also non-trivial and cost 3 cycles each. So there's an argument that locking a
mutex and performing this addition a single time can be less expensive than
iterating multiple times, because you're paying the same cost for CMPXCHG, but
you're definitely doing a single addition and no register file movements.
Here's what compilers generate: https://gcc.godbolt.org/z/Pc53W87Ps
Looks like state-of-the-art still has room for improvement. All four compilers
decided to load the value that STMXCSR saved to the stack to a regisre, even
though they had to re-save it to the stack again in order to LDMXCSR at the
end. A case where the register allocation went too aggressive.
I was also scratching my head at why GCC decided to skip the register file
operation and instead go through memory. My guess is that it's trying to
optimise port usage: the addition and the register file move could contend for
port 0. The old ICC did the same, so I think it's on purpose, not a glitch.
And WTF is MSVC doing?
Anyway, what this shows is that those who want to work on environments with
unmasked exceptions can achieve it. So why should everyone else pay the price?
> If the chip has SSE2, the compiler could use something like:
>
> stmxcsr [rsp]
> ldmxcsr [value_00001F80]
> retry:
> movss xmm1, [rdi]
> mov eax, [rdi]
> addss xmm1, xmm0
> movd ecx, xmm1
> lock cmpxchg [rdi], ecx
> jnz short retry
> ldmxcsr [rsp]
>
> LDMXCSR has a 7-cycle latency, but this whole sequence’s timing is
> dominated by the 100+-cycle lock cmpxchg.
CMPXCHG can be much faster than 100 cycles, depending on just how close that
cacheline is to the CPU executing the instruction. If it hasn't lost the
cacheline since the last write operation, it could be as low as 14 cycles.
On the other hand, LDMXCSR is also a memory instruction. Compilers would be
advised to load the 0x1f80 from the stack, thus ensuring locality, but I
wouldn't be surprised to see some generate code that attempts to load a global
0x1f80 because the optimiser figured that storing in the stack ahead of time is
more costly.
The cost of the instruction computed from uops alone may be misleading. This
instruction may cause a pipeline stall depending on what else is happening
with the FP hardware, because I don't know if the state is propagated with the
instructions as they're dispatched. For example,if you did:
x.fetch_add(1.0 / f);
The LDMXCSR may need to wait for the division to end and prevent the addition
from starting early. I'm speculating here; I'll ask the HW architects when I
get the chance.
Don't forget that the moves between register files (the MOVD instructions) are
also non-trivial and cost 3 cycles each. So there's an argument that locking a
mutex and performing this addition a single time can be less expensive than
iterating multiple times, because you're paying the same cost for CMPXCHG, but
you're definitely doing a single addition and no register file movements.
Here's what compilers generate: https://gcc.godbolt.org/z/Pc53W87Ps
Looks like state-of-the-art still has room for improvement. All four compilers
decided to load the value that STMXCSR saved to the stack to a regisre, even
though they had to re-save it to the stack again in order to LDMXCSR at the
end. A case where the register allocation went too aggressive.
I was also scratching my head at why GCC decided to skip the register file
operation and instead go through memory. My guess is that it's trying to
optimise port usage: the addition and the register file move could contend for
port 0. The old ICC did the same, so I think it's on purpose, not a glitch.
And WTF is MSVC doing?
Anyway, what this shows is that those who want to work on environments with
unmasked exceptions can achieve it. So why should everyone else pay the price?
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel DCAI Cloud Engineering
Received on 2023-08-24 06:20:49