If you assign a atomic<shared_ptr<T>>
to a shared_ptr<T> a complex
copy-process takes place.
A lock-free implementation is possible, but this includes a row of
atomic operations and thereby
expensive cacheline-invalidation in other CPU's caches.
So if you have a RCU-like pattern where the shared atomic<shared_ptr<T>> is
rarely updated
copying is rather expensive. So there could be an overload of the
assinment-operator of
shared_ptr<T> which internally
first shoud compare its own pointer against the pointer of
the atomic<shared_ptr<T>>
and if they're the same simply do nothing. If the participating
threads would keep a thread_local
copy of the atomic<shared_ptr<T>>
RCU-like patterns
would become more efficient than with userspace-RCU.
Hmm. Basically you're saying that
template<class T>
struct atomic<shared_ptr<T>> {
shared_ptr<T> load() const noexcept; // memory order omitted for simplicity
};
atomic<shared_ptr<T>> source;
shared_ptr<T> dest;
dest = source.load(); // THIS LINE
is slower than necessary whenever `dest == source.load()`. If that's a common situation, then we could speed up the common case with a new method such as:
template<class T>
struct atomic<shared_ptr<T>> {
shared_ptr<T> load() const noexcept; // memory order omitted for simplicity
void load_into(shared_ptr<T>& dest) const noexcept; // memory order omitted for simplicity
};
atomic<shared_ptr<T>> source;
shared_ptr<T> dest;
source.load_into(dest); // THIS LINE
This strikes me as the sort of thing that must already be possible, somehow, but I admit I don't see how to achieve it without a new method.
I also observe that in theory the same optimization could be useful for `shared_ptr::operator=(const shared_ptr&)` itself: if the dest is already equal (and owner_equal too) to the source, then we don't need to mess with its refcount. OTOH, maybe the cost of a branch-predict is worse than the cost of a redundant atomic increment/decrement cycle. In practice I observe that libstdc++ does the optimization and libc++ doesn't.
Have you tried implementing your idea in whatever atomic<shared_ptr<T>> library you currently use? Does it actually produce a performance boost?
my $.02,
–Arthur