ISOCPP std-proposals List: Re: [std-proposals] D3666R0 Bit-precise integers

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Fri, 5 Sep 2025 22:49:08 +0000

Ok, let's not forget that arm and risc are also supported CPUs.

________________________________
From: Std-Proposals <std-proposals-bounces_at_[hidden]> on behalf of Paul Caprioli via Std-Proposals <std-proposals_at_[hidden]>
Sent: Saturday, September 6, 2025 12:35:44 AM
To: std-proposals_at_[hidden] <std-proposals_at_lists.isocpp.org>
Cc: Paul Caprioli <paul_at_[hidden]>
Subject: Re: [std-proposals] D3666R0 Bit-precise integers

When I said it's one instruction, I meant that vmovdqu32 is one instruction and can handle loads that are split across cache lines.
The instruction vmovdqa32 is a different instruction, and it causes an exception if the memory address is not 64B-aligned.

Similarly for floating point: vmovups is one instruction, and vmovaps is one instruction.
The former handles loads that are split across cache lines.
The latter causes an exception if the address is not 64B aligned.

Yes, the compiler output I've seen (working in floating point) is always vmovups, even if the compiler knows the address is aligned.
That's fine; there's no penalty for using vmovups when the address is aligned (on hardware less than a decade or so old).

Yes, is a good thing for performance to align arrays by sticking `alignas(64)` in front.
Or, use std::aligned_alloc() with a 64B alignment.
The vmovups is faster and more efficient when the address is 64B aligned.

-----Original message-----
From: Jason McKesson via Std-Proposals <std-proposals_at_[hidden]>
Sent: Friday, September 5 2025, 3:17 pm
To: std-proposals_at_[hidden] <std-proposals_at_[hidden]>
Cc: Jason McKesson <jmckesson_at_[hidden]>
Subject: Re: [std-proposals] D3666R0 Bit-precise integers

On Fri, Sep 5, 2025 at 5:36 PM Paul Caprioli via Std-Proposals
<std-proposals_at_[hidden]rg> wrote:
>
> It's one AVX512 instruction. (Oh, I see in my previous email I referred to the floating-point instructions movaps and movups out of habit, but same idea.)
> Yes, you are correct that there is a performance penalty when a load (or store) is split across cache lines, and it can be significant.
> At the microarchitectural level, the cache delivers 64B-aligned chunks to the memory pipeline and the circuitry handles requesting two cachelines and putting the right bits into the register.
> So, in that sense, it's not a single load/store.
> But in the sense of assembly language, it's one instruction.

If it always compiles down to the same assembly/ML instructions...
can't you just stick `alignas(64)` in front of it if you want that
performance? It'd be the same code output either way, so this would
just be a thing you could do as an optimization if you need it, right?

Obviously if this is happening at a library interface, then adding
that alignment to that interface after the fact could break ABI. But
there's plenty of stuff that doesn't happen in library interfaces.
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

--
Std-Proposals mailing list
Std-Proposals_at_[hidden]
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-09-05 22:49:11