Date: Fri, 5 Sep 2025 21:36:10 +0000
It's one AVX512 instruction. (Oh, I see in my previous email I referred to the floating-point instructions movaps and movups out of habit, but same idea.)
Yes, you are correct that there is a performance penalty when a load (or store) is split across cache lines, and it can be significant.
At the microarchitectural level, the cache delivers 64B-aligned chunks to the memory pipeline and the circuitry handles requesting two cachelines and putting the right bits into the register.
So, in that sense, it's not a single load/store.
But in the sense of assembly language, it's one instruction.
-----Original message-----
From: Tiago Freire <tmiguelf_at_[hidden]>
Sent: Friday, September 5 2025, 2:24 pm
To: Paul Caprioli <paul_at_[hidden]>; std-proposals_at_[hidden] <std-proposals_at_[hidden]>
Subject: RE: [std-proposals] D3666R0 Bit-precise integers
The intel optimization guide section 15 explains this: https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
Even tough you have a function for it, it doesn't do it in a single load/store. And you incur in a performance penalty because it needs to get 2 cache lines and do the operation to each individually.
-----Original Message-----
From: Paul Caprioli <paul_at_[hidden]>
Sent: Friday, September 5, 2025 23:19
To: std-proposals_at_[hidden]
Cc: Tiago Freire <tmiguelf_at_[hidden]>
Subject: RE: [std-proposals] D3666R0 Bit-precise integers
> alignof(__m512i) = 64bytes
Interesting. That seems to be for performance reasons, since it's not required by hardware.
Note that the alignment of this type is 16 using GCC.
See: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_load*si512&ig_expand=5883,223,4019,4019,4103&avx512techs=AVX512F&cats=Load
Note that both assembly instructions, movaps and movups, have the same latency on the listed hardware (when the address is 64B-aligned).
That's not necessarily true on older hardware (e.g., sandybridge), which one might guess explains why there are two instructions.
Received on 2025-09-05 21:36:15