ISOCPP Std Discussion List: Re: Generalized atomic updates

From: Nate Eldredge <nate_at_[hidden]>
Date: Wed, 8 May 2024 12:17:51 -0600 (MDT)

On Wed, 8 May 2024, Thiago Macieira wrote:

> On Tuesday 7 May 2024 22:19:57 GMT-7 Nate Eldredge via Std-Discussion wrote:
>> Along the lines of std::atomic<T>::fetch_add() and friends, I have been
>> wondering about the possibility of an interface that would let you
>> atomically apply an *arbitrary* function to an atomic object. I'm
>> wondering if such an interface has ever been considered for inclusion in
>> C++.
>
> Hello Nate
>
> Not that I've seen it, but I did plan on submitting a paper on "atomic_mutate"
> that would generalise fetch_xxx with some extra operations and operands. Some
> hardware can do them more efficiently than others.
>
> But I haven't even begun writing the draft yet. I wrote the prototype API for
> it in https://codereview.qt-project.org/c/qt/qtbase/+/489704 and
> https://codereview.qt-project.org/c/qt/qtbase/+/490016

This is very interesting, thanks for sharing it! I'll try to take a
closer look sometime and contact you if I have any thoughts.

>> std::atomic<int> x;
>> int f(int);
>> x.fetch_update(f);
>>
>> would atomically perform the equivalent of
>>
>> x.store(f(x.load());
>>
>> Currently, one would normally accomplish this with a compare-exchange
>> loop:
>>
>> int tmp = x.load();
>> while (!x.compare_exchange_weak(tmp, f(tmp))) { }
>>
>> and this would be a valid implementation. However, many architectures
>> have load-link/store-conditional instructions (e.g. ARM64 with LDRX/STRX)
>> and could implement this more efficiently by executing f() in between the
>> LL and SC. So in effect, fetch_update would expose the LL/SC
>> functionality for use at the level of C++. It could also be implemented
>> with transactional memory where available.
>
> That's interesting, but f needs to be extremely simple, otherwise that
> machinery can't be used and the store-conditional will always fail. On some
> architectures (and my LL/SC is very rusty), you cannot place function calls
> between the two, so f must be always inlined.

Indeed. My idea was that if the SC didn't succeed within the first few
attempts, that we fall back to a compare-exchange loop. Of course, if
some other thread is hammering on the object, then the compare-exchange
might never succeed either, but that's already something the programmer
has to account for.

The compiler could also have some heuristics to decide statically if f is
"too complex" (including if it can't be inlined) and go directly to a
compare-exchange loop in that case.

> What's more, it would be difficult for the compiler to determine what you meant
> by simply analysing the code, in order to implement things like x86 XADD
> (fetch_add). Not impossible, just difficult. But impossible for the Standard
> Library to do so.

Sure. There would still be a place for all the fetch_xxx functions, to
expose the dedicated machine instructions directly when they exist. But
there would at least be an opportunity for compilers to identify places
where those instructions could be used, which they can't really do with a
compare-exchange loop. And when architectures introduce new atomic
instructions, this could be worthwhile as a "stopgap" until the Standard
gets around to defining the corresponding fetch_xxx. (Case in point:
ARMv8.1 introduced atomic integer min and max instructions back in 2017 or
so, and fetch_min / fetch_max are only slated for inclusion in C++26 - a
lag of nine years.)

> That's why in my implementation I propose that, instead of
> passing a lambda mutator, you pass a description of the operation you want
> done. You do that either via enums, as in:
>
> x.mutate(QAtomicModify::Xor, 0x1000);
>
> or in the nicer shorthand
>
> x.mutate(_A ^= 0x1000);

Well, we already have x.fetch_xor(0x1000), but I see in your code that
what's new is the ability to specify a predicate (like _A != 0) to
determine whether the store should be done at all. So it's a little bit
along the same lines as N3696's priority_update; the difference being that
you have more choice of arithmetic operations (N3696 essentially only lets
you do _A = x) but less choice of predicates (they allowed arbitrary
lambdas).

This certainly seems interesting and useful. And for instance, it would
solve the fetch_min problem I mentioned above, since you could detect
and special-case x.mutate(_A > b, _A = b).

In the code you linked, I only saw a generic compare-exchange
implementation. Have you tried at all to create any machine-specific
implementations, using inline assembly for LL/SC and/or RMW instructions
where available? I'd be curious what kind of code the compiler can
produce. If you haven't, I might try my hand at it, just for fun.

Still, though, I feel like even when you enlarge the "menu" of available
operations like this, there's always going to be things left out. For
instance, an LL/SC machine can trivially do an atomic shift, which we
currently have no way to do in C++ short of compare-exchange. Or imagine
some sort of multi-threaded Collatz code that wants to do an atomic 3*x+1.

(Minor digression: it's annoyed me for some time that C++ doesn't have
fetch_xxx for all the built-in arithmetic operators, which would at least
solve atomic shift. C allows you to apply any compound assignment
operator to an _Atomic object, so all compilers with C front-ends must
already have them implemented. The advantage of fetch_xxx would be that
you can recover the old value, and specify the memory order.)

>> For atomic types that are not lock-free, fetch_update could improve on
>> compare-exchange by simply executing f() with the lock held. And even on
>> systems where compare-exchange is the only possible implementation, it
>> would be simpler and less error-prone for the programmer to write
>> x.fetch_update(f) instead of writing the compare-exchange loop every time.
>
> Executing user code with the lock held is usually a big red flag. It goes back
> to the same problem of "the lambda must be really really simple" I mentioned
> above. I prefer to have the comparison and mutating operations be described
> instead of being passed as a lambda.

That's a fair point. To avoid deadlocks, I suppose you could require that
f must not do anything that could block: no taking mutexes, no accessing
any non-lock-free atomics, etc. But maybe there are further pitfalls that
I haven't thought of.

Anyway, thanks again!

-- 
Nate Eldredge
nate_at_[hidden]

Received on 2024-05-08 18:17:55

std-discussion