On Mon, Feb 10, 2025 at 1:57 AM Ivan Matek via Std-Proposals <std-proposals@lists.isocpp.org> wrote:


On Fri, Feb 7, 2025 at 5:48 PM Arthur O'Dwyer <arthur.j.odwyer@gmail.com> wrote:
My own benchmark for SG14's `unstable_remove_if` (here) compares the algorithm on a vector(30'000) of array<int,16>. Its reported timings on my MacBook just now are 620µs for unstable_remove_if, 650µs for stable remove_if.

On my intel cpu with clang and libstdc++ numbers are a bit better for unstable, i.e. speedup is larger, but similar story
UnstableRemoveIf/iterations:20000     243797 ns       243755 ns        20000
StdPartition/iterations:20000         290692 ns       290617 ns        20000
StdRemoveIf/iterations:20000          314727 ns       314668 ns        20000
 
Numbers itself are not comparable since I prefer fixed number of iterations, but in general unstable is fastest.

But what I wanted is to show that if one is creative(i.e. I looked for largest realistic setup where I expected unstable to shine) with test setup larger speedups can be observed. 
I have picked strings of maximum SSO size to make moves expensive and then things look quite good for unstable.
unstable_remove_if/iterations:100000       92.3 ns         92.3 ns       100000
std_partition/iterations:100000             106 ns          106 ns       100000
std_remove_if/iterations:100000             142 ns          142 ns       100000


When we make strings allocate speedups are as expected not that impressive because advantage over partition is smaller.  Without profiling this seems expected since advantage over partition is that we do not need to keep "removed" item fine as partition does and
that makes number of operations we need to do when operating in SSO buffer much smaller. When we are just rewiring heap pointers difference between move and swap is probably not that large.
unstable_remove_if/iterations:100000       95.5 ns         95.5 ns       100000
std_partition/iterations:100000             100 ns          100 ns       100000
std_remove_if/iterations:100000             135 ns          135 ns       100000


What is interesting that when we add some fraction of heap allocations in test data remove_if seems to suffer a lot.  I presumed this is because of branch prediction, and basic checks with perf show indeed branch misprediction differences, but unclear if that is cause of performance differences.
unstable_remove_if/iterations:100000       96.6 ns         96.6 ns       100000
std_partition/iterations:100000             110 ns          110 ns       100000
std_remove_if/iterations:100000             172 ns          172 ns       100000

What about the indirect benefits? From how I understand the proposal the unstable_* functions do a whole lot less cache line clearing and memcpy'ing; this should have a general benefit on software beyond what you can measure in a microbenchmark.