C++ Logo

std-proposals

Advanced search

Re: [std-proposals] D3666R0 Bit-precise integers

From: Paul Caprioli <paul_at_[hidden]>
Date: Fri, 5 Sep 2025 21:36:10 +0000
It's one AVX512 instruction. (Oh, I see in my previous email I referred to the floating-point instructions movaps and movups out of habit, but same idea.) Yes, you are correct that there is a performance penalty when a load (or store) is split across cache lines, and it can be significant. At the microarchitectural level, the cache delivers 64B-aligned chunks to the memory pipeline and the circuitry handles requesting two cachelines and putting the right bits into the register. So, in that sense, it's not a single load/store. But in the sense of assembly language, it's one instruction. -----Original message----- From: Tiago Freire <tmiguelf_at_[hidden]> Sent: Friday, September 5 2025, 2:24 pm To: Paul Caprioli <paul_at_[hidden]>; std-proposals_at_[hidden] <std-proposals_at_[hidden]> Subject: RE: [std-proposals] D3666R0 Bit-precise integers The intel optimization guide section 15 explains this: https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html Even tough you have a function for it, it doesn't do it in a single load/store. And you incur in a performance penalty because it needs to get 2 cache lines and do the operation to each individually. -----Original Message----- From: Paul Caprioli <paul_at_[hidden]> Sent: Friday, September 5, 2025 23:19 To: std-proposals_at_[hidden] Cc: Tiago Freire <tmiguelf_at_[hidden]> Subject: RE: [std-proposals] D3666R0 Bit-precise integers > alignof(__m512i) = 64bytes Interesting. That seems to be for performance reasons, since it's not required by hardware. Note that the alignment of this type is 16 using GCC. See: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_load*si512&ig_expand=5883,223,4019,4019,4103&avx512techs=AVX512F&cats=Load Note that both assembly instructions, movaps and movups, have the same latency on the listed hardware (when the address is 64B-aligned). That's not necessarily true on older hardware (e.g., sandybridge), which one might guess explains why there are two instructions.

Received on 2025-09-05 21:36:15