On Sun, Aug 25, 2024 at 11:22 AM Oliver Schädlich via Std-Proposals <std-proposals@lists.isocpp.org> wrote:

If you assign a atomic<shared_ptr<T>> to a shared_ptr<T> a complex copy-process takes place.
A lock-free implementation is possible, but this includes a row of atomic operations and thereby
expensive cacheline-invalidation in other CPU's caches.
So if you have a RCU-like pattern where the shared atomic<shared_ptr<T>> is rarely updated
copying is rather expensive. So there could be an overload of the assinment-operator of
shared_ptr<T> which internally first shoud compare its own pointer against the pointer of
the atomic<shared_ptr<T>> and if they're the same simply do nothing. If the participating
threads would keep a thread_local copy of the atomic<shared_ptr<T>> RCU-like patterns
would become more efficient than with userspace-RCU.


Hmm. Basically you're saying that

    template<class T>
    struct atomic<shared_ptr<T>> {
        shared_ptr<T> load() const noexcept;  // memory order omitted for simplicity
    };

    atomic<shared_ptr<T>> source;
    shared_ptr<T> dest;
    dest = source.load();  // THIS LINE

is slower than necessary whenever `dest == source.load()`. If that's a common situation, then we could speed up the common case with a new method such as:

    template<class T>
    struct atomic<shared_ptr<T>> {
        shared_ptr<T> load() const noexcept;  // memory order omitted for simplicity
        void load_into(shared_ptr<T>& dest) const noexcept;  // memory order omitted for simplicity
    };

    atomic<shared_ptr<T>> source;
    shared_ptr<T> dest;
    source.load_into(dest);  // THIS LINE

This strikes me as the sort of thing that must already be possible, somehow, but I admit I don't see how to achieve it without a new method.

I also observe that in theory the same optimization could be useful for `shared_ptr::operator=(const shared_ptr&)` itself: if the dest is already equal (and owner_equal too) to the source, then we don't need to mess with its refcount. OTOH, maybe the cost of a branch-predict is worse than the cost of a redundant atomic increment/decrement cycle. In practice I observe that libstdc++ does the optimization and libc++ doesn't.
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/shared_ptr_base.h#L1079-L1092
https://github.com/llvm/llvm-project/blob/main/libcxx/include/__memory/shared_ptr.h#L671-L674

Have you tried implementing your idea in whatever atomic<shared_ptr<T>> library you currently use? Does it actually produce a performance boost?

my $.02,
–Arthur