Date: Fri, 5 Sep 2025 21:24:04 +0000
The intel optimization guide section 15 explains this: https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
Even tough you have a function for it, it doesn't do it in a single load/store. And you incur in a performance penalty because it needs to get 2 cache lines and do the operation to each individually.
-----Original Message-----
From: Paul Caprioli <paul_at_hpkfft.com>
Sent: Friday, September 5, 2025 23:19
To: std-proposals_at_lists.isocpp.org
Cc: Tiago Freire <tmiguelf_at_hotmail.com>
Subject: RE: [std-proposals] D3666R0 Bit-precise integers
> alignof(__m512i) = 64bytes
Interesting. That seems to be for performance reasons, since it's not required by hardware.
Note that the alignment of this type is 16 using GCC.
See: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_load*si512&ig_expand=5883,223,4019,4019,4103&avx512techs=AVX512F&cats=Load
Note that both assembly instructions, movaps and movups, have the same latency on the listed hardware (when the address is 64B-aligned).
That's not necessarily true on older hardware (e.g., sandybridge), which one might guess explains why there are two instructions.
Even tough you have a function for it, it doesn't do it in a single load/store. And you incur in a performance penalty because it needs to get 2 cache lines and do the operation to each individually.
-----Original Message-----
From: Paul Caprioli <paul_at_hpkfft.com>
Sent: Friday, September 5, 2025 23:19
To: std-proposals_at_lists.isocpp.org
Cc: Tiago Freire <tmiguelf_at_hotmail.com>
Subject: RE: [std-proposals] D3666R0 Bit-precise integers
> alignof(__m512i) = 64bytes
Interesting. That seems to be for performance reasons, since it's not required by hardware.
Note that the alignment of this type is 16 using GCC.
See: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_load*si512&ig_expand=5883,223,4019,4019,4103&avx512techs=AVX512F&cats=Load
Note that both assembly instructions, movaps and movups, have the same latency on the listed hardware (when the address is 64B-aligned).
That's not necessarily true on older hardware (e.g., sandybridge), which one might guess explains why there are two instructions.
Received on 2025-09-05 21:24:07