Date: Mon, 10 Feb 2025 01:57:21 +0100
On Fri, Feb 7, 2025 at 5:48 PM Arthur O'Dwyer <arthur.j.odwyer_at_[hidden]>
wrote:
> My own benchmark for SG14's `unstable_remove_if` (here
> <https://github.com/Quuxplusone/SG14/blob/b509fc470a30e1d19820487f899f987c12653a68/benchmarks/unstable_remove_bench.cpp#L20-L38>)
> compares the algorithm on a vector(30'000) of array<int,16>. Its reported
> timings on my MacBook just now are 620µs for unstable_remove_if, 650µs for
> stable remove_if.
>
>>
>> On my intel cpu with clang and libstdc++ numbers are a bit better for
unstable, i.e. speedup is larger, but similar story
UnstableRemoveIf/iterations:20000 243797 ns 243755 ns 20000
StdPartition/iterations:20000 290692 ns 290617 ns 20000
StdRemoveIf/iterations:20000 314727 ns 314668 ns 20000
Numbers itself are not comparable since I prefer fixed number of
iterations, but in general unstable is fastest.
But what I wanted is to show that if one is creative(i.e. I looked for
largest realistic setup where I expected unstable to shine) with test setup
larger speedups can be observed.
I have picked strings of maximum SSO size to make moves expensive and then
things look quite good for unstable.
unstable_remove_if/iterations:100000 92.3 ns 92.3 ns
100000
std_partition/iterations:100000 106 ns 106 ns
100000
std_remove_if/iterations:100000 142 ns 142 ns
100000
When we make strings allocate speedups are as expected not that impressive
because advantage over partition is smaller. Without profiling this seems
expected since advantage over partition is that we do not need to keep
"removed" item fine as partition does and
that makes number of operations we need to do when operating in SSO buffer
much smaller. When we are just rewiring heap pointers difference between
move and swap is probably not that large.
unstable_remove_if/iterations:100000 95.5 ns 95.5 ns
100000
std_partition/iterations:100000 100 ns 100 ns
100000
std_remove_if/iterations:100000 135 ns 135 ns
100000
What is interesting that when we add some fraction of heap allocations in
test data remove_if seems to suffer a lot. I presumed this is because of
branch prediction, and basic checks with perf show indeed branch
misprediction differences, but unclear if that is cause of performance
differences.
unstable_remove_if/iterations:100000 96.6 ns 96.6 ns
100000
std_partition/iterations:100000 110 ns 110 ns
100000
std_remove_if/iterations:100000 172 ns 172 ns
100000
All this numbers should be taken with a grain of salt, I did not test with
libc++ or inspect asm.
If somebody wants to try it on his machine feel free to just DL the
attachment, and try it with different policies for test data(3 are
supported and match 3 results pasted above).
enum class Alloc
{
No,
Yes,
Rare // to confuse branch predictor we mostly allocate inside SSO
range, but not always
};
You will need SG14 repo linked above and place it in benchmark folder and
edit CMakeLists.txt.
As for Arthur suggestion of using something like shift_left where we shift
in batches(if that is what is suggested): I am not aware of any benchmarks
for this.
wrote:
> My own benchmark for SG14's `unstable_remove_if` (here
> <https://github.com/Quuxplusone/SG14/blob/b509fc470a30e1d19820487f899f987c12653a68/benchmarks/unstable_remove_bench.cpp#L20-L38>)
> compares the algorithm on a vector(30'000) of array<int,16>. Its reported
> timings on my MacBook just now are 620µs for unstable_remove_if, 650µs for
> stable remove_if.
>
>>
>> On my intel cpu with clang and libstdc++ numbers are a bit better for
unstable, i.e. speedup is larger, but similar story
UnstableRemoveIf/iterations:20000 243797 ns 243755 ns 20000
StdPartition/iterations:20000 290692 ns 290617 ns 20000
StdRemoveIf/iterations:20000 314727 ns 314668 ns 20000
Numbers itself are not comparable since I prefer fixed number of
iterations, but in general unstable is fastest.
But what I wanted is to show that if one is creative(i.e. I looked for
largest realistic setup where I expected unstable to shine) with test setup
larger speedups can be observed.
I have picked strings of maximum SSO size to make moves expensive and then
things look quite good for unstable.
unstable_remove_if/iterations:100000 92.3 ns 92.3 ns
100000
std_partition/iterations:100000 106 ns 106 ns
100000
std_remove_if/iterations:100000 142 ns 142 ns
100000
When we make strings allocate speedups are as expected not that impressive
because advantage over partition is smaller. Without profiling this seems
expected since advantage over partition is that we do not need to keep
"removed" item fine as partition does and
that makes number of operations we need to do when operating in SSO buffer
much smaller. When we are just rewiring heap pointers difference between
move and swap is probably not that large.
unstable_remove_if/iterations:100000 95.5 ns 95.5 ns
100000
std_partition/iterations:100000 100 ns 100 ns
100000
std_remove_if/iterations:100000 135 ns 135 ns
100000
What is interesting that when we add some fraction of heap allocations in
test data remove_if seems to suffer a lot. I presumed this is because of
branch prediction, and basic checks with perf show indeed branch
misprediction differences, but unclear if that is cause of performance
differences.
unstable_remove_if/iterations:100000 96.6 ns 96.6 ns
100000
std_partition/iterations:100000 110 ns 110 ns
100000
std_remove_if/iterations:100000 172 ns 172 ns
100000
All this numbers should be taken with a grain of salt, I did not test with
libc++ or inspect asm.
If somebody wants to try it on his machine feel free to just DL the
attachment, and try it with different policies for test data(3 are
supported and match 3 results pasted above).
enum class Alloc
{
No,
Yes,
Rare // to confuse branch predictor we mostly allocate inside SSO
range, but not always
};
You will need SG14 repo linked above and place it in benchmark folder and
edit CMakeLists.txt.
As for Arthur suggestion of using something like shift_left where we shift
in batches(if that is what is suggested): I am not aware of any benchmarks
for this.
Received on 2025-02-10 00:57:36