ISOCPP std-proposals List: Re: [std-proposals] std::atomic_pointer

From: Andrey Semashev <andrey.semashev_at_[hidden]>
Date: Tue, 30 Dec 2025 00:45:43 +0300

On 29 Dec 2025 23:56, Thiago Macieira via Std-Proposals wrote:
> On Monday, 29 December 2025 17:34:30 Brasilia Standard Time Andrey Semashev
> via Std-Proposals wrote:
>>> Also do note the next extension isn't documented: all Intel AVX512-capable
>>> processors support 256-bit atomic loads and stores.
>>
>> This may be the case for existing CPUs, but is this guaranteed for
>> future products? Was there some sort of official commitment to maintain
>> this guarantee? Especially given that Intel E-cores are going to support
>> AVX10, which has 512-bit vectors.
>
> Neither is it guaranteed architecturally for the AVX2-capable ones for 128-
> bit. The SDM section you've referred to says what current products do, but
> fails to promise future ones will continue doing so.

No, the section says "Processors that enumerate support for Intel® AVX
[...] guarantee that the 16-byte memory operations performed by the
following instructions will always be carried out atomically: [...]".
This includes past and future products that support AVX. As far as I'm
concerned, this is as "architectural" as any other kind of atomic memory
access.

> However, knowing the microarchitectural details, it's likely to stay for good.
> The issue here is that N-bit wide operations can be cracked into two ½N-bit
> wide uops with acceptable performance, but cracking four-ways into ¼N-bit uops
> (4x128-bit loads for example) aren't likely so. It would introduce an extra 3-
> cycle latency for some operations and keep ports busy for far too long.
> Keeping it running while the machine takes exceptions is also difficult. This
> was done for the Sandybridge/Ivy Bridge AVX1 generation, the Gracemont & later
> E-cores when those got AVX1 & 2, and I understand that's how AMD did both AVX1
> & 2 at first, as well as AVX512 in their newest before transitioning to full
> 512-bit wide data paths. The E-cores on the upcoming AVX10-equipped Nova Lake
> are probably going to be the same, but I haven't studied the E-core
> microarchitecture in detail.

First, having an N-bit data path to cache doesn't guarantee that N-bit
memory accesses are going to be atomic. Neither it does for N/2-bit or
N/4-bit. There may be any number of microarchtectural reasons for the
access to be non-atomic. For example, the CPU might not power up wider
data path to conserve energy. What the note about AVX in SDM says in
particular is that there are no such microarcitectural reasons in the
CPUs featuring AVX (otherwise, such CPUs would not be conforming to
IA32/Intel64).

Second, I do not think a quad-pumped design would necessarily have poor
performance. The three-cycle latency is irrelevant to (or, at least, not
representative of) performance characteristics of such a design. (Note
that the 3-cycle latency exists for cross-lane instructions in the CPUs
that have native 256 and 512-bit pipelines.) In Zen 4, AMD has shown
that a 256-bit design can implement 512-bit vector operations
practically with the same CPI as 256-bit operations (with the exception
of loads and stores, and probably a few cross-lane shuffles). I don't
see why the same could not be achieved with a 128-bit design.

But this is all getting off-topic on this list.

> I don't see much need for 256-bit atomic loads and stores, seeing as we lack a
> 256-bit CAS.

I agree with this part.

Received on 2025-12-29 21:45:46