Date: Thu, 11 Sep 2025 13:06:33 -0700
On Thursday, 11 September 2025 12:00:48 Pacific Daylight Time Lénárd Szolnoki
via Std-Proposals wrote:
> On 11/09/2025 16:58, Tiago Freire wrote:
> > Look at this signature:
> >
> > void func(_BitInt(512)& var);
> >
> > true or false, if this function uses var in a way that requires being
> > loaded into zmm does it require realignment?
I'll depend on what the psABI document says. If it says that objects shall be
aligned naturally up to 512 bits, then the above will be aligned. If it says
aligned only up to 64 bits (which is what the current psABI documents say),
then it won't be aligned to 64 bytes, only 64 bits.
> False, it doesn't, it can just use unaligned load, which should be much
> faster than realigning the object in memory then using an aligned load in
> all cases.
>
> If you don't want it to be a performance penalty, you can call it with a
> reference to a suitably aligned object, which you can arrange with alignas.
> AFAIK unaligned load instructions are not any slower on recent CPUs when
> given a memory address that happens to be suitably aligned.
For a single object, that's true.
For an array of such objects, the effect could accumulate. But I wonder, what
operation could you do on an array of _BitInt that could be vectorised to such
a high degree? Bitwise operations don't need to operate on full objects.
Arithmetic does, but there are no vector instructions for this.
You could do a 512-bit add by doing one 64-bit add and detect an overflow, then
adding to the next lane and so forth, at a depth of 8, so a minimum latency of
24 cycles (assuming 1 cycle per each of 3 uops, which isn't the case today).
That means you'd get a misalignment penalty only after 48 entries in your
array, which is way too much work for any current processor do in a single
cycle.
via Std-Proposals wrote:
> On 11/09/2025 16:58, Tiago Freire wrote:
> > Look at this signature:
> >
> > void func(_BitInt(512)& var);
> >
> > true or false, if this function uses var in a way that requires being
> > loaded into zmm does it require realignment?
I'll depend on what the psABI document says. If it says that objects shall be
aligned naturally up to 512 bits, then the above will be aligned. If it says
aligned only up to 64 bits (which is what the current psABI documents say),
then it won't be aligned to 64 bytes, only 64 bits.
> False, it doesn't, it can just use unaligned load, which should be much
> faster than realigning the object in memory then using an aligned load in
> all cases.
>
> If you don't want it to be a performance penalty, you can call it with a
> reference to a suitably aligned object, which you can arrange with alignas.
> AFAIK unaligned load instructions are not any slower on recent CPUs when
> given a memory address that happens to be suitably aligned.
For a single object, that's true.
For an array of such objects, the effect could accumulate. But I wonder, what
operation could you do on an array of _BitInt that could be vectorised to such
a high degree? Bitwise operations don't need to operate on full objects.
Arithmetic does, but there are no vector instructions for this.
You could do a 512-bit add by doing one 64-bit add and detect an overflow, then
adding to the next lane and so forth, at a depth of 8, so a minimum latency of
24 cycles (assuming 1 cycle per each of 3 uops, which isn't the case today).
That means you'd get a misalignment penalty only after 48 entries in your
array, which is way too much work for any current processor do in a single
cycle.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Platform & System Engineering
Received on 2025-09-11 20:06:41