On Mon, Feb 10, 2025 at 1:57 AM Ivan Matek via Std-Proposals <std-proposals@lists.isocpp.org> wrote:

On Fri, Feb 7, 2025 at 5:48 PM Arthur O'Dwyer <arthur.j.odwyer@gmail.com> wrote:
My own benchmark for SG14's `unstable_remove_if` (here) compares the algorithm on a vector(30'000) of array<int,16>. Its reported timings on my MacBook just now are 620µs for unstable_remove_if, 650µs for stable remove_if.

On my intel cpu with clang and libstdc++ numbers are a bit better for unstable, i.e. speedup is larger, but similar story
UnstableRemoveIf/iterations:20000 243797 ns 243755 ns 20000
StdPartition/iterations:20000 290692 ns 290617 ns 20000
StdRemoveIf/iterations:20000 314727 ns 314668 ns 20000

Numbers itself are not comparable since I prefer fixed number of iterations, but in general unstable is fastest.

But what I wanted is to show that if one is creative(i.e. I looked for largest realistic setup where I expected unstable to shine) with test setup larger speedups can be observed.
I have picked strings of maximum SSO size to make moves expensive and then things look quite good for unstable.
unstable_remove_if/iterations:100000 92.3 ns 92.3 ns 100000
std_partition/iterations:100000 106 ns 106 ns 100000
std_remove_if/iterations:100000 142 ns 142 ns 100000

When we make strings allocate speedups are as expected not that impressive because advantage over partition is smaller. Without profiling this seems expected since advantage over partition is that we do not need to keep "removed" item fine as partition does and
that makes number of operations we need to do when operating in SSO buffer much smaller. When we are just rewiring heap pointers difference between move and swap is probably not that large.
unstable_remove_if/iterations:100000 95.5 ns 95.5 ns 100000
std_partition/iterations:100000 100 ns 100 ns 100000
std_remove_if/iterations:100000 135 ns 135 ns 100000

What is interesting that when we add some fraction of heap allocations in test data remove_if seems to suffer a lot. I presumed this is because of branch prediction, and basic checks with perf show indeed branch misprediction differences, but unclear if that is cause of performance differences.
unstable_remove_if/iterations:100000 96.6 ns 96.6 ns 100000
std_partition/iterations:100000 110 ns 110 ns 100000
std_remove_if/iterations:100000 172 ns 172 ns 100000

What about the indirect benefits? From how I understand the proposal the unstable_* functions do a whole lot less cache line clearing and memcpy'ing; this should have a general benefit on software beyond what you can measure in a microbenchmark.