ISOCPP Std Discussion List: Re: Synchronization of atomic notify and wait

From: Nate Eldredge <nate_at_[hidden]>
Date: Sun, 24 Jul 2022 17:59:46 -0700 (PDT)

On Tue, 19 Jul 2022, Ryan Nicholl via Std-Discussion wrote:

> Interjection here, but isn't the point of atomic wait/notify to
> implement things like condition variables? In that case, a read-only
> operation in relaxed ordering would not be required to obtain any value
> written before notify_one unless notify_one has an implied release. I'm
> unsure that the implied "happens before" ordering for notify exists
> especially given spurious wakeups could cause it to return before write
> and a write could happen in relaxed mode which implies no ordering. I
> doubt this is an issue in practice since the kernel will probably flush
> caches on context switch triggering a seq_cst-like behavior, but it's
> good to think about. I would say notify does NOT order anything and you
> "must" use acquire/release on the atomic to get ordering. Anything else
> could require a lot of overhead in the implementation if the kernel
> doesn't flush caches. My assumption is that wait/notify are intended as
> low level wrappers that map directly to SYS_futex, waitv,
> WaitForMultipleObjectsEx and similar system calls, so any kind of
> sequencing would make the implementation tricky if missing in any OS.

Under this interpretation, I believe it would actually be impossible to
ever use wait/notify without risking deadlock, regardless of the memory
order you select. I used relaxed ordering in my example for simplicity,
but the problem persists with stronger orders.

If we do:

void thr1() {
     b.store(true, std::memory_order_release);
     b.notify_one();
}

int main() {
     std::thread t(thr1);
     b.wait(false, std::memory_order_acquire);
     t.join();
     return 0;
}

the only way to get synchronization between the threads is by having the
load in wait() take its value from the store() in thr1
(https://eel.is/c++draft/atomics.order#2). But that is precisely what we
are unable to prove, so we are still in the same pickle.

Even if all accesses are seq_cst, I still can't prove that it's
deadlock-free. The loads and stores participate in the total order S, but
it is not made clear whether notify and unblock do so. So S could have
both the pre-block and post-unblock loads preceding the store, in which
case both loads return false, and the main thread blocks a second time.
I have not been able to show that this would contradict any of the
ordering rules.

It could even fail under my "perverse" implementation upthread; nothing
prevents the relaxed __counter.fetch_add() from being hoisted above the
release b.store(), nor the relaxed __counter.load() from being sunk below
an acquire b.load(). The barriers go the wrong way. And adding seq_cst
doesn't help either: it gives us more guarantees about how seq_cst
operations are ordered with respect to each other, but buys us nothing
about how they are ordered with respect to weaker operations, which is the
issue here.

Adding synchronizes-with wouldn't normally require the kernel or anyone
else to flush caches. As I understand it, modern multiprocessor systems
have coherent cache anyway, and the issues with memory ordering come from
features like per-core store buffers, which delay the commit to L1 cache.
A release barrier just has to wait for the store buffer to drain, which is
no worse than the corresponding cache misses on an in-order machine, and
an acquire barrier just has to inhibit out-of-order speculative loads from
passing it. So they are much cheaper than cache flushes or IPI or
anything like that.

-- 
Nate Eldredge
nate_at_[hidden]

Received on 2022-07-25 00:59:50

std-discussion