C++ Logo

std-proposals

Advanced search

Re: [std-proposals] D3666R0 Bit-precise integers

From: Paul Caprioli <paul_at_[hidden]>
Date: Fri, 5 Sep 2025 22:35:37 +0000
When I said it's one instruction, I meant that vmovdqu32 is one instruction and can handle loads that are split across cache lines. The instruction vmovdqa32 is a different instruction, and it causes an exception if the memory address is not 64B-aligned. Similarly for floating point: vmovups is one instruction, and vmovaps is one instruction. The former handles loads that are split across cache lines. The latter causes an exception if the address is not 64B aligned. Yes, the compiler output I've seen (working in floating point) is always vmovups, even if the compiler knows the address is aligned. That's fine; there's no penalty for using vmovups when the address is aligned (on hardware less than a decade or so old). Yes, is a good thing for performance to align arrays by sticking `alignas(64)` in front. Or, use std::aligned_alloc() with a 64B alignment. The vmovups is faster and more efficient when the address is 64B aligned. -----Original message----- From: Jason McKesson via Std-Proposals <std-proposals_at_[hidden]> Sent: Friday, September 5 2025, 3:17 pm To: std-proposals_at_[hidden] <std-proposals_at_[hidden]> Cc: Jason McKesson <jmckesson_at_[hidden]> Subject: Re: [std-proposals] D3666R0 Bit-precise integers On Fri, Sep 5, 2025 at 5:36 PM Paul Caprioli via Std-Proposals <std-proposals_at_[hidden]> wrote: > > It's one AVX512 instruction. (Oh, I see in my previous email I referred to the floating-point instructions movaps and movups out of habit, but same idea.) > Yes, you are correct that there is a performance penalty when a load (or store) is split across cache lines, and it can be significant. > At the microarchitectural level, the cache delivers 64B-aligned chunks to the memory pipeline and the circuitry handles requesting two cachelines and putting the right bits into the register. > So, in that sense, it's not a single load/store. > But in the sense of assembly language, it's one instruction. If it always compiles down to the same assembly/ML instructions... can't you just stick `alignas(64)` in front of it if you want that performance? It'd be the same code output either way, so this would just be a thing you could do as an optimization if you need it, right? Obviously if this is happening at a library interface, then adding that alignment to that interface after the fact could break ABI. But there's plenty of stuff that doesn't happen in library interfaces. -- Std-Proposals mailing list Std-Proposals_at_[hidden] https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-09-05 22:35:38