Date: Sun, 24 Jul 2022 17:59:46 -0700 (PDT)
On Tue, 19 Jul 2022, Ryan Nicholl via Std-Discussion wrote:
> Interjection here, but isn't the point of atomic wait/notify to
> implement things like condition variables? In that case, a read-only
> operation in relaxed ordering would not be required to obtain any value
> written before notify_one unless notify_one has an implied release. I'm
> unsure that the implied "happens before" ordering for notify exists
> especially given spurious wakeups could cause it to return before write
> and a write could happen in relaxed mode which implies no ordering. I
> doubt this is an issue in practice since the kernel will probably flush
> caches on context switch triggering a seq_cst-like behavior, but it's
> good to think about. I would say notify does NOT order anything and you
> "must" use acquire/release on the atomic to get ordering. Anything else
> could require a lot of overhead in the implementation if the kernel
> doesn't flush caches. My assumption is that wait/notify are intended as
> low level wrappers that map directly to SYS_futex, waitv,
> WaitForMultipleObjectsEx and similar system calls, so any kind of
> sequencing would make the implementation tricky if missing in any OS.
Under this interpretation, I believe it would actually be impossible to
ever use wait/notify without risking deadlock, regardless of the memory
order you select. I used relaxed ordering in my example for simplicity,
but the problem persists with stronger orders.
If we do:
void thr1() {
b.store(true, std::memory_order_release);
b.notify_one();
}
int main() {
std::thread t(thr1);
b.wait(false, std::memory_order_acquire);
t.join();
return 0;
}
the only way to get synchronization between the threads is by having the
load in wait() take its value from the store() in thr1
(https://eel.is/c++draft/atomics.order#2). But that is precisely what we
are unable to prove, so we are still in the same pickle.
Even if all accesses are seq_cst, I still can't prove that it's
deadlock-free. The loads and stores participate in the total order S, but
it is not made clear whether notify and unblock do so. So S could have
both the pre-block and post-unblock loads preceding the store, in which
case both loads return false, and the main thread blocks a second time.
I have not been able to show that this would contradict any of the
ordering rules.
It could even fail under my "perverse" implementation upthread; nothing
prevents the relaxed __counter.fetch_add() from being hoisted above the
release b.store(), nor the relaxed __counter.load() from being sunk below
an acquire b.load(). The barriers go the wrong way. And adding seq_cst
doesn't help either: it gives us more guarantees about how seq_cst
operations are ordered with respect to each other, but buys us nothing
about how they are ordered with respect to weaker operations, which is the
issue here.
Adding synchronizes-with wouldn't normally require the kernel or anyone
else to flush caches. As I understand it, modern multiprocessor systems
have coherent cache anyway, and the issues with memory ordering come from
features like per-core store buffers, which delay the commit to L1 cache.
A release barrier just has to wait for the store buffer to drain, which is
no worse than the corresponding cache misses on an in-order machine, and
an acquire barrier just has to inhibit out-of-order speculative loads from
passing it. So they are much cheaper than cache flushes or IPI or
anything like that.
> Interjection here, but isn't the point of atomic wait/notify to
> implement things like condition variables? In that case, a read-only
> operation in relaxed ordering would not be required to obtain any value
> written before notify_one unless notify_one has an implied release. I'm
> unsure that the implied "happens before" ordering for notify exists
> especially given spurious wakeups could cause it to return before write
> and a write could happen in relaxed mode which implies no ordering. I
> doubt this is an issue in practice since the kernel will probably flush
> caches on context switch triggering a seq_cst-like behavior, but it's
> good to think about. I would say notify does NOT order anything and you
> "must" use acquire/release on the atomic to get ordering. Anything else
> could require a lot of overhead in the implementation if the kernel
> doesn't flush caches. My assumption is that wait/notify are intended as
> low level wrappers that map directly to SYS_futex, waitv,
> WaitForMultipleObjectsEx and similar system calls, so any kind of
> sequencing would make the implementation tricky if missing in any OS.
Under this interpretation, I believe it would actually be impossible to
ever use wait/notify without risking deadlock, regardless of the memory
order you select. I used relaxed ordering in my example for simplicity,
but the problem persists with stronger orders.
If we do:
void thr1() {
b.store(true, std::memory_order_release);
b.notify_one();
}
int main() {
std::thread t(thr1);
b.wait(false, std::memory_order_acquire);
t.join();
return 0;
}
the only way to get synchronization between the threads is by having the
load in wait() take its value from the store() in thr1
(https://eel.is/c++draft/atomics.order#2). But that is precisely what we
are unable to prove, so we are still in the same pickle.
Even if all accesses are seq_cst, I still can't prove that it's
deadlock-free. The loads and stores participate in the total order S, but
it is not made clear whether notify and unblock do so. So S could have
both the pre-block and post-unblock loads preceding the store, in which
case both loads return false, and the main thread blocks a second time.
I have not been able to show that this would contradict any of the
ordering rules.
It could even fail under my "perverse" implementation upthread; nothing
prevents the relaxed __counter.fetch_add() from being hoisted above the
release b.store(), nor the relaxed __counter.load() from being sunk below
an acquire b.load(). The barriers go the wrong way. And adding seq_cst
doesn't help either: it gives us more guarantees about how seq_cst
operations are ordered with respect to each other, but buys us nothing
about how they are ordered with respect to weaker operations, which is the
issue here.
Adding synchronizes-with wouldn't normally require the kernel or anyone
else to flush caches. As I understand it, modern multiprocessor systems
have coherent cache anyway, and the issues with memory ordering come from
features like per-core store buffers, which delay the commit to L1 cache.
A release barrier just has to wait for the store buffer to drain, which is
no worse than the corresponding cache misses on an in-order machine, and
an acquire barrier just has to inhibit out-of-order speculative loads from
passing it. So they are much cheaper than cache flushes or IPI or
anything like that.
-- Nate Eldredge nate_at_[hidden]
Received on 2022-07-25 00:59:50