ISOCPP std-proposals List: Re: [std-proposals] Improvement for std::shared

From: Arthur O'Dwyer <arthur.j.odwyer_at_[hidden]>
Date: Sun, 25 Aug 2024 11:45:24 -0400

On Sun, Aug 25, 2024 at 11:22 AM Oliver Schädlich via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> If you assign a atomic<shared_ptr<T>> to a shared_ptr<T> a complex
> copy-process takes place.
> A lock-free implementation is possible, but this includes a row of atomic
> operations and thereby
> expensive cacheline-invalidation in other CPU's caches.
> So if you have a RCU-like pattern where the shared atomic<shared_ptr<T>>
> is rarely updated
> copying is rather expensive. So there could be an overload of the
> assinment-operator of
> shared_ptr<T> which internally first shoud compare its own pointer
> against the pointer of
> the atomic<shared_ptr<T>> and if they're the same simply do nothing. If
> the participating
> threads would keep a thread_local copy of the atomic<shared_ptr<T>>
> RCU-like patterns
> would become more efficient than with userspace-RCU.
>

Hmm. Basically you're saying that

    template<class T>
    struct atomic<shared_ptr<T>> {
        shared_ptr<T> load() const noexcept; // memory order omitted for
simplicity
    };

    atomic<shared_ptr<T>> source;
    shared_ptr<T> dest;
    dest = source.load(); // THIS LINE

is slower than necessary whenever `dest == source.load()`. If that's a
common situation, then we could speed up the common case with a new method
such as:

    template<class T>
    struct atomic<shared_ptr<T>> {
        shared_ptr<T> load() const noexcept; // memory order omitted for
simplicity
        void load_into(shared_ptr<T>& dest) const noexcept; // memory
order omitted for simplicity
    };

    atomic<shared_ptr<T>> source;
    shared_ptr<T> dest;
    source.load_into(dest); // THIS LINE

This strikes me as the sort of thing that must already be possible,
somehow, but I admit I don't see how to achieve it without a new method.

I also observe that in theory the same optimization could be useful for
`shared_ptr::operator=(const shared_ptr&)` itself: if the dest is already
equal (and owner_equal too) to the source, then we don't need to mess with
its refcount. OTOH, maybe the cost of a branch-predict is worse than the
cost of a redundant atomic increment/decrement cycle. In practice I observe
that libstdc++ does the optimization and libc++ doesn't.
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/shared_ptr_base.h#L1079-L1092
https://github.com/llvm/llvm-project/blob/main/libcxx/include/__memory/shared_ptr.h#L671-L674

Have you tried implementing your idea in whatever atomic<shared_ptr<T>>
library you currently use? Does it actually produce a performance boost?

my $.02,
–Arthur

Received on 2024-08-25 15:45:38