Date: Mon, 29 Dec 2025 21:42:27 -0300
On Monday, 29 December 2025 18:45:43 Brasilia Standard Time Andrey Semashev
via Std-Proposals wrote:
> No, the section says "Processors that enumerate support for IntelĀ® AVX
> [...] guarantee that the 16-byte memory operations performed by the
> following instructions will always be carried out atomically: [...]".
> This includes past and future products that support AVX. As far as I'm
> concerned, this is as "architectural" as any other kind of atomic memory
> access.
I stand corrected.
> Second, I do not think a quad-pumped design would necessarily have poor
> performance. The three-cycle latency is irrelevant to (or, at least, not
> representative of) performance characteristics of such a design. (Note
> that the 3-cycle latency exists for cross-lane instructions in the CPUs
> that have native 256 and 512-bit pipelines.) In Zen 4, AMD has shown
> that a 256-bit design can implement 512-bit vector operations
> practically with the same CPI as 256-bit operations (with the exception
> of loads and stores, and probably a few cross-lane shuffles). I don't
> see why the same could not be achieved with a 128-bit design.
I meant that cracking one op into 4 uops means there are 3 more uops to be
dispatched and likely in sequence to the same port, even if others are free.
Cracking once is significantly less complex on the CPU implementation than
cracking four-ways. The number of instructions per cycle over the long run is
amortised because it's one extra cycle of latency overall, over however many
instructions, except if you make branch decisions based on the late data
(which you shouldn't in SIMD code anyway).
> But this is all getting off-topic on this list.
Right.
What I can say is that I have tested every AVX512-capable P-core CPU and they
all can do atomic loads and stores up to 512 bits, so long as the data is
inside of a single cacheline. The first AVX512 E-core will be NVL but I haven't
got my hands on one yet.
via Std-Proposals wrote:
> No, the section says "Processors that enumerate support for IntelĀ® AVX
> [...] guarantee that the 16-byte memory operations performed by the
> following instructions will always be carried out atomically: [...]".
> This includes past and future products that support AVX. As far as I'm
> concerned, this is as "architectural" as any other kind of atomic memory
> access.
I stand corrected.
> Second, I do not think a quad-pumped design would necessarily have poor
> performance. The three-cycle latency is irrelevant to (or, at least, not
> representative of) performance characteristics of such a design. (Note
> that the 3-cycle latency exists for cross-lane instructions in the CPUs
> that have native 256 and 512-bit pipelines.) In Zen 4, AMD has shown
> that a 256-bit design can implement 512-bit vector operations
> practically with the same CPI as 256-bit operations (with the exception
> of loads and stores, and probably a few cross-lane shuffles). I don't
> see why the same could not be achieved with a 128-bit design.
I meant that cracking one op into 4 uops means there are 3 more uops to be
dispatched and likely in sequence to the same port, even if others are free.
Cracking once is significantly less complex on the CPU implementation than
cracking four-ways. The number of instructions per cycle over the long run is
amortised because it's one extra cycle of latency overall, over however many
instructions, except if you make branch decisions based on the late data
(which you shouldn't in SIMD code anyway).
> But this is all getting off-topic on this list.
Right.
What I can say is that I have tested every AVX512-capable P-core CPU and they
all can do atomic loads and stores up to 512 bits, so long as the data is
inside of a single cacheline. The first AVX512 E-core will be NVL but I haven't
got my hands on one yet.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Data Center - Platform & Sys. Eng.
Received on 2025-12-30 00:42:37
