ISOCPP Std Discussion List: Re: Synchronization of atomic notify and wait

From: Nate Eldredge <nate_at_[hidden]>
Date: Sun, 24 Jul 2022 18:27:17 -0700 (PDT)

On Wed, 20 Jul 2022, Marcin Jaczewski via Std-Discussion wrote:

> How could it be possible that after unblocking it loads old value?
> load is atomic and value is already changed as it "happens before" a
> call to `notify`.

If you look back at the "perverse" implementation I posted before, where
"blocking" is just a spin loop and "unblocking" is exiting that loop, the
issue boils down to the fact that, when one relaxed store happens-before
another, it does not follow that they will be observed by another thread
in the same order. Perhaps this is the source of your confusion?

So, if we do a typical store-store litmus test:

thr1:
a.store(true, std::memory_order_relaxed);
b.store(true, std::memory_order_relaxed);

thr2:
b_tmp = b.load();
a_tmp = a.load();

it is entirely possible that we end up with b_tmp == true and a_tmp ==
false.

It is perfectly true here that a.store() happens-before b.store(), by
sequencing. But this by itself does not help us. In order to conclude
anything about the value of a_tmp, we have to show that a.store()
happens-before a.load(), and then use write-read coherence
(https://eel.is/c++draft/intro.races#18). That's exactly what we can't
do. The only way for a happens-before relation to cross thread boundaries
is synchronizes-with, which requires us to have an acquire load that takes
its value from a release store. We have no release stores in this
program, so nothing synchronizes with anything else, and we cannot prove
any happens-before between the stores in thr1 and the loads in thr2.

Allowing behavior like this is the entire point of relaxed ordering, and
you can observe it happening on mainstream weakly-ordered architectures
like ARM64. On modern desktop/server machines, this would be implemented
without anything so demanding as the cache flushes and IPIs in language
lawyer's description. All you need is an out-of-order store buffer.

On such a system, the cores have coherent L1 cache via something like the
MESI protocol, so once a store is committed to L1 cache, it is globally
visible. But the execution (retirement) of a store instruction does not
force an immediate commit to L1 cache. Rather, the store goes into a
core-local store buffer, from which stores are commited in arbitrary order
as the core acquires exclusivity (MESI E state) on their respective cache
lines. So if by luck, the cache line for b is already held by the thr1
core, but the cache line for a is owned by another core, or is missing
from L1 cache and must be read from L2/L3/DRAM, then the store to b may
very well commit to L1 and become globally visible before the store to a.

On such a system, a release store would insert a barrier in the store
buffer, so that the commitment of the store to b is delayed, if necessary,
until all preceding stores have committed. To implement a more
strongly-ordered architecture like x86, where all stores are release, you
would simply make the store buffer a strict FIFO queue where entries never
leapfrog each other.

-- 
Nate Eldredge
nate_at_[hidden]

Received on 2022-07-25 01:27:22

std-discussion