C++ Logo


Advanced search

Re: Synchronization of atomic notify and wait

From: Ryan Nicholl <exaeta_at_[hidden]>
Date: Mon, 25 Jul 2022 04:37:46 +0000
Wait must observe the state changing from expected to something else to return. It cannot deadlock because the sleep is combined with the check through special system calls like SYS_futex. Kernel either A. guarantees you are able to receive notifications on the wait list THEN checks the value for any changes, or B. does both steps of A in an atomic manner, such that no race is possible. One problem i do see is that Linux cannot implement wait correctly on 64 bit values without futex waitv which is a pretty recent add-on, but I think libatomic authors will find a workaround (I assume), though it might be slightly slower, such as a global lock?

Ryan Nicholl
Sent from Proton Mail mobile
-------- Original Message --------
On Jul 24, 2022, 20:59, Nate Eldredge wrote:
> On Tue, 19 Jul 2022, Ryan Nicholl via Std-Discussion wrote: > Interjection here, but isn't the point of atomic wait/notify to > implement things like condition variables? In that case, a read-only > operation in relaxed ordering would not be required to obtain any value > written before notify_one unless notify_one has an implied release. I'm > unsure that the implied "happens before" ordering for notify exists > especially given spurious wakeups could cause it to return before write > and a write could happen in relaxed mode which implies no ordering. I > doubt this is an issue in practice since the kernel will probably flush > caches on context switch triggering a seq_cst-like behavior, but it's > good to think about. I would say notify does NOT order anything and you > "must" use acquire/release on the atomic to get ordering. Anything else > could require a lot of overhead in the implementation if the kernel > doesn't flush caches. My assumption is that wait/notify are intended as > low level wrappers that map directly to SYS_futex, waitv, > WaitForMultipleObjectsEx and similar system calls, so any kind of > sequencing would make the implementation tricky if missing in any OS. Under this interpretation, I believe it would actually be impossible to ever use wait/notify without risking deadlock, regardless of the memory order you select. I used relaxed ordering in my example for simplicity, but the problem persists with stronger orders. If we do: void thr1() { b.store(true, std::memory_order_release); b.notify_one(); } int main() { std::thread t(thr1); b.wait(false, std::memory_order_acquire); t.join(); return 0; } the only way to get synchronization between the threads is by having the load in wait() take its value from the store() in thr1 (https://eel.is/c++draft/atomics.order#2). But that is precisely what we are unable to prove, so we are still in the same pickle. Even if all accesses are seq_cst, I still can't prove that it's deadlock-free. The loads and stores participate in the total order S, but it is not made clear whether notify and unblock do so. So S could have both the pre-block and post-unblock loads preceding the store, in which case both loads return false, and the main thread blocks a second time. I have not been able to show that this would contradict any of the ordering rules. It could even fail under my "perverse" implementation upthread; nothing prevents the relaxed __counter.fetch_add() from being hoisted above the release b.store(), nor the relaxed __counter.load() from being sunk below an acquire b.load(). The barriers go the wrong way. And adding seq_cst doesn't help either: it gives us more guarantees about how seq_cst operations are ordered with respect to each other, but buys us nothing about how they are ordered with respect to weaker operations, which is the issue here. Adding synchronizes-with wouldn't normally require the kernel or anyone else to flush caches. As I understand it, modern multiprocessor systems have coherent cache anyway, and the issues with memory ordering come from features like per-core store buffers, which delay the commit to L1 cache. A release barrier just has to wait for the store buffer to drain, which is no worse than the corresponding cache misses on an in-order machine, and an acquire barrier just has to inhibit out-of-order speculative loads from passing it. So they are much cheaper than cache flushes or IPI or anything like that. -- Nate Eldredge nate_at_[hidden]

Received on 2022-07-25 04:37:52