P1478 suggests that the added atomic_{load,store}_per_byte_memcpy can be implemented without accessing each byte individually

Note that on standard hardware, it should be OK to actually perform the copy at larger than byte granularity. Copying multiple bytes as part of one operation is indistinguishable from running them so quickly that the intermediate state is not observed. In fact, we expect that existing assembly memcpy implementations will suffice when suffixed with the required fence. 

A Rust RFC was made to mirror the same primitives in Rust, which also suggested that the implementation is not restricted to byte-sized accesses. The question of mixed size access was raised: https://github.com/rust-lang/rfcs/pull/3301#pullrequestreview-1082913781, which appears to apply to P1478 as well but was not discussed in the proposal.

Since the actual access granularity of atomic memcpy is implementation-defined and opaque to the user, the user cannot guarantee that the same granularity is used when they access the same address using something other than atomic_{load,store}_per_byte_memcpyor with a different count argument. It may be possible to guarantee a consistent granularity for each type, but since the proposed API takes void*, this isn’t possible.

For instance, it's valid to have

#include <atomic>


struct Foo {
  std::atomic_int32_t a;
};

void thread1(Foo& foo) {
  foo.a.load(std::memory_order_acquire);
}

void thread2(Foo& foo) {
  Foo foo_copy;
  std::atomic_load_per_byte_memcpy(&foo_copy, &foo, sizeof(Foo), std::memory_order_acquire);
}

Resulting in the same address being accessed by two threads with potentially differently sized atomic operations, without order.

The proposal seems to suggest that this is fine, but Intel disagrees

Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor uses accesses a semaphore using a word access, other processors should not access the semaphore using a byte access.

From Intel Architectures Software Developer’s Manual 3A §9.1.2.2 Software Controlled Bus Locking

Partly due to this, Rust currently considers racing mixed size atomic accesses to be UB, and this is an outstanding concern of the RFC. Since C++ has strongly typed memory, it was not possible to perform mixed size atomic accesses (without UB), but P1478 appears to open this up. I wonder what people here think?

Andy