ISOCPP std-proposals List: Re: [std-proposals] Implementability of P1478: Byte-wise atomic memcpy on x86

From: Thiago Macieira <thiago_at_[hidden]>
Date: Thu, 06 Apr 2023 14:26:49 -0300

On Thursday, 6 April 2023 12:10:32 -03 Andy via Std-Proposals wrote:
> Resulting in the same address being accessed by two threads with potentially
> differently sized atomic operations, without order.
>
> The proposal seems to suggest that this is fine, but Intel disagrees

I disagree that the proposal suggests that is fine. In addition to the
difference that Jens has noted, the paper explicitly says

"Similarly, we would expect undefined behavior if the writer updates the source
using atomic operations of a different granularity"

Though that is shortly before and in support of the introduction of
atomic_store_per_byte_memcpy. Other than the above, there's no discussion
about atomic operations of different sizes / granularities. Therefore, I
wouldn't say the paper suggests that it's fine; I'd say the paper is silent on
the subject and therefore lets things stand as status quo.

> > Software should access semaphores (shared memory used for signalling
> > between multiple processors) using identical addresses and operand
> > lengths. For example, if one processor uses accesses a semaphore using a
> > word access, other processors should not access the semaphore using a
> > byte access.
> From Intel Architectures Software Developer’s Manual 3A §9.1.2.2 Software
> Controlled Bus Locking
>
> Partly due to this, Rust currently considers racing mixed size atomic
> accesses to be UB, and this is an outstanding concern of the RFC. Since C++
> has strongly typed memory, it was not possible to perform mixed size atomic
> accesses (without UB), but P1478 appears to open this up. I wonder what
> people here think?

I agree it should be disallowed, at least for cross-platform behaviour.

The paper does talk about how the *implementation* of those two functions
could operate in higher granularities. And because of that, it is possible
that in some architectures such operations will not produce correct results
when mixed with a concurrent one in another thread of a different size. But
even then, I am not sure this is a fatal problem, because any tearing that
this could produce would be obviated by the fact that the load operation does
not guarantee atomicity of at any level higher than a byte and it's up to the
surrounding code to ensure that the data that was loaded didn't get torn
(that's the sequence number).

However, if you know your architecture, this *could* be fine. Even on Intel
processors (note: I work for Intel and this is an area I am very familiar
with) you *can* mix different-sized operations and still retain atomicity,
provided you obey some rules:

* never cross a cacheline boundary (preferably, align naturally)
* use operations of 16 bytes or less (ideally: use only 1-uop operations)
* for all current in-market processors and I believe this applies to AMD
* for all current P-core processors: 32- and 64-byte accesses are also fine
Note: I don't know how valid this is for the new RAO instructions, but they
should be ok for CMPccXADD.

In fact, I've seen a few codebases that do use the fact that they can mix two
atomic_uint32_t with an overlapping atomic_uint64_t, at least when it comes to
interfacing with the Linux kernel 32-bit futex support, see
https://codebrowser.dev/glibc/glibc/nptl/sem_waitcommon.c.html#do_futex_wait

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DCAI Cloud Engineering

Received on 2023-04-06 17:26:57