C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Gap in the Standard Library? Single-pass fused reductions (similar to Clojure transducers)

From: Máté Ték <eppenpontaz_at_[hidden]>
Date: Mon, 27 Apr 2026 13:23:03 +0200
Did some benchmarking and got some interesting results.
quick-bench.com was acting up and refused to work, so I had to do it by
hand.
I hope I didn't mess it up. I encourage you to verify the code and fix any
issues.

> It's not a great example. In your transformation below, you mention
find_if, but that's not exactly what you have here. You don't quit
early when you have found an isMyFavorite element, so you're finding the
last element matching the predicate in the range. That's not what find_if does,
and it may be a lot slower.

I don't think that's true.
if (!myFavorite && isMyFavorite(i)) myFavorite = &i;
myFavorite will only be assigned once upon the first detected 'my favorite'
integer.
We don't quit the loop early because we still need to perform the min
search, but due to !myFavorite we stop looking for my favorite once found.
Therefore this should be equivalent to std::find_if.

Now the benchmark.
I created std::vector<int>'s with various lengths (100, 250, 500, 1k, 2k,
5k, 10k, 1M, 10M, 100M).
Each contains random integers, but not std::numeric_limits<int>::max(), and
all contain my favorite integer exactly once in the middle.
I then benchmarked the already shown code with two reductions, finding my
favorite int and finding the smallest even number.
I compared the hand-made fused approach, the hand-made separate approach,
and a std::ranges based separate approach.
I tried to make sure the small-vector benchmark is not polluted by the
std::chrono calls.
The measured code runs in a loop for at least 2 seconds.
The reported durations are calculated as the elapsed time divided by the
number of completed repetitions.
The test cases run in random order.
Compiled with Visual Studio 2026, C++23, /O2 optimizations.
Run on HP ZBook i7-13850HX

Separate vec_100 339.355 ns
Ranges vec_100 424.226 ns
Fused vec_100 311.085 ns

Separate vec_250 820.78 ns
Fused vec_250 675.184 ns
Ranges vec_250 1.2445 us

Fused vec_500 1.36191 us
Ranges vec_500 2.45033 us
Separate vec_500 1.73693 us

Fused vec_1k 2.82963 us
Ranges vec_1k 4.74799 us
Separate vec_1k 4.02432 us

Fused vec_2k 5.96424 us
Ranges vec_2k 11.0527 us
Separate vec_2k 13.5118 us

Fused vec_5k 26.3804 us
Ranges vec_5k 38.922 us
Separate vec_5k 45.5524 us

Fused vec_10k 74.6032 us
Separate vec_10k 95.6572 us
Ranges vec_10k 100.218 us

Fused vec_1M 10.8347 ms
Separate vec_1M 10.9401 ms
Ranges vec_1M 11.72 ms

Separate vec_10M 111.501 ms
Fused vec_10M 110.26 ms
Ranges vec_10M 124.79 ms

Separate vec_100M 1.11905 s
Ranges vec_100M 1.21792 s
Fused vec_100M 1.06251 s

The fused approach seems to be fastest most of the time, but never worse
than the separate approach.
Also, the std::ranges approach is almost always the slowest for some
reason? Maybe due to bounds checking? Maybe I am misusing them? Feel free
to fix it.

I wanted to see what happens if we add a third reduction: checking if any
of the integers are equal to std::numeric_limits<int>::max().
(They won't be.)
Similar story.

Separate vec_100 436.476 ns
Fused vec_100 349.672 ns
Ranges vec_100 632.794 ns

Ranges vec_250 1.57164 us
Fused vec_250 868.757 ns
Separate vec_250 1.1785 us

Fused vec_500 1.70301 us
Ranges vec_500 2.97359 us
Separate vec_500 2.3378 us

Separate vec_1k 5.02519 us
Fused vec_1k 3.67588 us
Ranges vec_1k 6.71979 us

Separate vec_2k 11.6636 us
Fused vec_2k 7.43435 us
Ranges vec_2k 12.801 us

Separate vec_5k 40.2481 us
Fused vec_5k 28.1171 us
Ranges vec_5k 44.0867 us

Separate vec_10k 104.112 us
Fused vec_10k 78.0701 us
Ranges vec_10k 115.564 us

Fused vec_1M 11.3306 ms
Ranges vec_1M 13.4072 ms
Separate vec_1M 11.7792 ms

Fused vec_10M 113.612 ms
Separate vec_10M 121.84 ms
Ranges vec_10M 143.856 ms

Ranges vec_100M 1.43543 s
Fused vec_100M 1.12957 s
Separate vec_100M 1.19948 s

I also wanted to see what happens if we use
std::vector<std::unique_ptr<int>>'s to simulate a different scenario.
Maybe we have a collection of more complex objects in a vector, and we are
accessing a dynamically allocated member of these objects.
(Though in these tests all vectors are adapted with std::views::transform
so I can use the same code as before, which expects a range of integers and
not pointers.)
First with 2 reductions.
Interestingly the separate approach is better for smaller vectors (<= 1k
elements), but the fused approach is better for larger ones.

Fused ptrs_100 358.013 ns
Ranges ptrs_100 426.052 ns
Separate ptrs_100 313.278 ns

Separate ptrs_250 836.917 ns
Fused ptrs_250 857.285 ns
Ranges ptrs_250 1.38199 us

Fused ptrs_500 1.73367 us
Ranges ptrs_500 2.44668 us
Separate ptrs_500 1.62045 us

Separate ptrs_1k 3.45427 us
Ranges ptrs_1k 5.71435 us
Fused ptrs_1k 3.41132 us

Ranges ptrs_2k 11.2554 us
Fused ptrs_2k 6.69008 us
Separate ptrs_2k 10.2046 us

Fused ptrs_5k 36.3655 us
Separate ptrs_5k 45.4845 us
Ranges ptrs_5k 49.0578 us

Fused ptrs_10k 100.999 us
Ranges ptrs_10k 134.156 us
Separate ptrs_10k 117.271 us

Ranges ptrs_1M 40.4017 ms
Separate ptrs_1M 38.6883 ms
Fused ptrs_1M 30.7326 ms

Separate ptrs_10M 808.657 ms
Fused ptrs_10M 594.558 ms
Ranges ptrs_10M 826.348 ms

Fused ptrs_100M 8.05112 s
Separate ptrs_100M 12.0049 s
Ranges ptrs_100M 12.255 s

With 3 reductions, the fused approach is always better, with a 2x speedup
with large vectors:

Separate ptrs_100 391.52 ns
Ranges ptrs_100 742.995 ns
Fused ptrs_100 292.496 ns

Separate ptrs_250 1.07791 us
Fused ptrs_250 709.377 ns
Ranges ptrs_250 1.8128 us

Fused ptrs_500 1.41061 us
Separate ptrs_500 2.03489 us
Ranges ptrs_500 3.84051 us

Ranges ptrs_1k 7.22235 us
Fused ptrs_1k 2.90761 us
Separate ptrs_1k 4.33632 us

Ranges ptrs_2k 17.4583 us
Fused ptrs_2k 6.75004 us
Separate ptrs_2k 8.96157 us

Separate ptrs_5k 45.0996 us
Fused ptrs_5k 34.5204 us
Ranges ptrs_5k 61.18 us

Separate ptrs_10k 122.581 us
Fused ptrs_10k 96.4118 us
Ranges ptrs_10k 158.637 us

Fused ptrs_1M 30.4041 ms
Separate ptrs_1M 55.6052 ms
Ranges ptrs_1M 57.6549 ms

Fused ptrs_10M 595.322 ms
Ranges ptrs_10M 1.30666 s
Separate ptrs_10M 1.29215 s

Fused ptrs_100M 8.36999 s
Ranges ptrs_100M 20.2597 s
Separate ptrs_100M 19.1692 s

I never noticed any speedups/slowdowns that can be attributed to the
relative order of the test cases. It seems very stable and reproducible.
Here's the full code (a single main.cpp):

#include <random>
#include <vector>
#include <algorithm>
#include <limits>
#include <chrono>
#include <tuple>
#include <ranges>
#include <iostream>
#include <string_view>
#include <format>
#include <functional>

// By default there are two reductions: find my favorite int and find the
minimum of even ints.
// This flag enables a third reduction: check if any int is equal to
std::numeric_limits<int>::max()
#define ENABLE_ANY_OF_MAX_INT_

#ifdef ENABLE_ANY_OF_MAX_INT
#define IF_ANY_OF_MAX_INT(...) __VA_ARGS__
#else
#define IF_ANY_OF_MAX_INT(...)
#endif

// Attach debugger
#define WAIT_FOR_INPUT_

// 'vec' for std::vector<int>, 'ptrs' for std::vector<std::unique_ptr<int>>
with deref view
#define BENCH_TYPE vec

auto maxDetector = [](int i) noexcept -> bool { return
std::numeric_limits<int>::max() == i; };

volatile int myFavoriteInt = 7979;

auto favoriteDetector = [](int i) noexcept -> bool { return myFavoriteInt
== i; };

bool IsEven(int i) noexcept { return i % 2 == 0; }

std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution distrib;

// Returns not my favorite and not int max:
int Random()
{
    do {
        int i = distrib(gen);
        if (!favoriteDetector(i) && !maxDetector(i)) return i;
    } while (true);
}

// Returned vector:
// - contains my favorite int exactly once in the middle,
// - does not contain int max:
template<std::size_t Sz> requires (Sz >= 10)
auto MakeRandomVec()
{
    std::vector<int> init(Sz);
    std::generate_n(init.begin(), Sz, Random);
    init[Sz / 2] = myFavoriteInt;
    return init;
}

using namespace std::chrono;

std::string FormatPrettyDuration(const auto dur)
{
    std::string output;
    auto tryPeriod = [&]<class dur_t>(dur_t&&)
    {
        const auto durFloat = duration_cast<duration<double, typename
dur_t::period>>(dur);
        if (1. <= durFloat.count() && durFloat.count() < 1000.)
            output = std::vformat("{0:7.3%Q} {0:%q}",
std::make_format_args(durFloat));
    };
    tryPeriod(nanoseconds{});
    tryPeriod(microseconds{});
    tryPeriod(milliseconds{});
    tryPeriod(seconds{});
    return output;
}

using perf_clock = high_resolution_clock; // Probably same as steady_clock,
but whatever

void RunBenchAndReport(const auto method,
                       const auto& vals,
                       const perf_clock::duration targetDuration,
                       const std::size_t batchSz,
                       const std::string_view context) noexcept
{
    // Measurement stuff
    const auto start = perf_clock::now();
    perf_clock::duration elapsed;
    std::size_t repeatCount = 0;

    // Avoid optimizations
    const int* volatile myFave = nullptr;
    const int* volatile minEven = nullptr;
    volatile int acc1{ 0 }, acc2{ 0 };
    IF_ANY_OF_MAX_INT(volatile bool anyIntMax = false);

    // Run benchmark for at least 'targetDuration' and count number of
repeats
    do {
        for (auto i = batchSz; i > 0; --i) {
            std::tie(myFave, minEven IF_ANY_OF_MAX_INT(, anyIntMax)) =
method(vals);

            // Should never be hit, just avoid optimization
            if (!myFave || !minEven IF_ANY_OF_MAX_INT(|| anyIntMax))
                abort();

            if (i % 2u == 0u) {
                acc1 += *myFave;
                acc2 += *minEven;
            }
            else {
                acc1 -= *myFave;
                acc2 -= *minEven;
            }
        }
        repeatCount += batchSz;
        elapsed = perf_clock::now() - start;
    } while (elapsed < targetDuration);

    const auto durationPerRepeat = duration<double,
perf_clock::duration::period>{
        elapsed.count() / static_cast<double>(repeatCount)
    };
    std::cout << std::vformat("{0: <20}", std::make_format_args(context))
        << FormatPrettyDuration(durationPerRepeat)
        << " (Dummies: " << acc1 << " " << acc2 IF_ANY_OF_MAX_INT(<< "
" << anyIntMax) << ")"
        << '\n';
}

auto testFused = [](const auto& vals) {
    const int* myFavorite = nullptr;
    const int* minEven = nullptr;
    IF_ANY_OF_MAX_INT(bool anyIntMax = false);
    for (const int& i : vals) {
        if (!myFavorite && favoriteDetector(i)) myFavorite = &i;
        if (IsEven(i) && (!minEven || i < *minEven)) minEven = &i;
        IF_ANY_OF_MAX_INT(if (!anyIntMax) anyIntMax = maxDetector(i));
    }
    return std::make_tuple(myFavorite, minEven IF_ANY_OF_MAX_INT(,
anyIntMax));
};

auto testSeparate = [](const auto& vals) {
    const int* myFavorite = nullptr;
    for (const int& i : vals) {
        if (favoriteDetector(i)) {
            myFavorite = &i;
            break;
        }
    }
    const int* minEven = nullptr;
    for (const int& i : vals) {
        if (IsEven(i) && (!minEven || i < *minEven)) minEven = &i;
    }
    IF_ANY_OF_MAX_INT(bool anyIntMax = false);
    IF_ANY_OF_MAX_INT(for (const int& i : vals) { if (maxDetector(i)) {
anyIntMax = true; break; } });
    return std::make_tuple(myFavorite, minEven IF_ANY_OF_MAX_INT(,
anyIntMax));
};

int testRanges_myFavorite;
int testRanges_minEven;
auto testRanges = [](const auto& vals) {
    testRanges_myFavorite = (vals |
std::views::filter(favoriteDetector)).front();
    testRanges_minEven = std::ranges::min(vals |
std::views::filter(IsEven), std::less<>{});
    IF_ANY_OF_MAX_INT(bool anyIntMax = std::ranges::any_of(vals,
maxDetector));
    return std::make_tuple(&testRanges_myFavorite, &testRanges_minEven
IF_ANY_OF_MAX_INT(, anyIntMax));
};

const auto vec_100 = MakeRandomVec<100>();
const auto vec_250 = MakeRandomVec<250>();
const auto vec_500 = MakeRandomVec<500>();
const auto vec_1k = MakeRandomVec<1'000>();
const auto vec_2k = MakeRandomVec<2'000>();
const auto vec_5k = MakeRandomVec<5'000>();
const auto vec_10k = MakeRandomVec<10'000>();
const auto vec_1M = MakeRandomVec<1'000'000>();
const auto vec_10M = MakeRandomVec<10'000'000>();
const auto vec_100M = MakeRandomVec<100'000'000>();

constexpr auto from_normal_vec = [](const auto& vec) {
    auto make_unique = std::views::transform([](int i) { return
std::make_unique<int>(i); });
    auto ret = std::vector(std::from_range, vec | make_unique);
    std::shuffle(ret.begin(), ret.end(), gen); // Just for good measure
    // Keep my favorite int in the middle
    std::swap(ret[ret.size() / 2], *std::ranges::find_if(ret, [](auto&&
ptr) { return favoriteDetector(*ptr); }));
    return ret;
};

const auto storage_ptrs_100 = from_normal_vec(vec_100);
const auto storage_ptrs_250 = from_normal_vec(vec_250);
const auto storage_ptrs_500 = from_normal_vec(vec_500);
const auto storage_ptrs_1k = from_normal_vec(vec_1k);
const auto storage_ptrs_2k = from_normal_vec(vec_2k);
const auto storage_ptrs_5k = from_normal_vec(vec_5k);
const auto storage_ptrs_10k = from_normal_vec(vec_10k);
const auto storage_ptrs_1M = from_normal_vec(vec_1M);
const auto storage_ptrs_10M = from_normal_vec(vec_10M);
const auto storage_ptrs_100M = from_normal_vec(vec_100M);

constexpr auto deref = std::views::transform([](auto&& ptr) ->
decltype(auto) { return *ptr; });

const auto ptrs_100 = storage_ptrs_100 | deref;
const auto ptrs_250 = storage_ptrs_250 | deref;
const auto ptrs_500 = storage_ptrs_500 | deref;
const auto ptrs_1k = storage_ptrs_1k | deref;
const auto ptrs_2k = storage_ptrs_2k | deref;
const auto ptrs_5k = storage_ptrs_5k | deref;
const auto ptrs_10k = storage_ptrs_10k | deref;
const auto ptrs_1M = storage_ptrs_1M | deref;
const auto ptrs_10M = storage_ptrs_10M | deref;
const auto ptrs_100M = storage_ptrs_100M | deref;


int main()
{
#ifdef WAIT_FOR_INPUT
    std::cout << "Enter a character to continue: ";
    char c;
    std::cin >> c;
#endif

    std::array<std::function<void()>, 3u> testCases;

#define ID(x) x
#define STR(x) #x
#define COMPARE(coll, size) \
    std::cout << "\nBegin run...\n"; \
    testCases[0] = [&] { RunBenchAndReport(testSeparate, ID(coll) ## _ ##
size, duration, batchSz, "Separate " STR(coll) "_" #size); }; \
    testCases[1] = [&] { RunBenchAndReport(testRanges, ID(coll) ## _ ##
size, duration, batchSz, "Ranges " STR(coll) "_" #size); }; \
    testCases[2] = [&] { RunBenchAndReport(testFused, ID(coll) ## _ ##
size, duration, batchSz, "Fused " STR(coll) "_" #size); }; \
std::shuffle(testCases.begin(), testCases.end(), gen); \
std::ranges::for_each(testCases, std::invoke<std::function<void ()>&>)

    constexpr seconds duration(2);

    std::size_t batchSz;

bench_begin:
    std::cout << "\n\n ---------------- Test Run ----------------\n";
    batchSz = 1'000'000u;
    COMPARE(BENCH_TYPE, 100);
    COMPARE(BENCH_TYPE, 250);
    COMPARE(BENCH_TYPE, 500);

    batchSz = 100'000u;
    COMPARE(BENCH_TYPE, 1k);
    COMPARE(BENCH_TYPE, 2k);
    COMPARE(BENCH_TYPE, 5k);

    batchSz = 10'000u;
    COMPARE(BENCH_TYPE, 10k);

    batchSz = 100u;
    COMPARE(BENCH_TYPE, 1M);

    batchSz = 10u;
    COMPARE(BENCH_TYPE, 10M);

    batchSz = 2u;
    COMPARE(BENCH_TYPE, 100M);

    goto bench_begin; // Kill the program manually if you had enough
}

On Sun, 26 Apr 2026 at 08:07, Jens Maurer via Std-Proposals <
std-proposals_at_[hidden]> wrote:

>
>
> On 4/26/26 07:06, Jan Schultke via Std-Proposals wrote:
> > Idon't think it's necessary to cover every possible combination of
> algorithms/reductions in one function. At some point, you will just need to
> use a custom loop, and that's fine.
> >
> > I was refactoring code at my workplace the other day, which looked
> something like this (simplified):
> >
> >
> >
> > std::vector<int> values = ... ;
> > auto isMyFavorite = [](int i) -> bool { ... };
> > int* myFavorite = nullptr;
> > int* minEven = nullptr;
> > for (int& i : values) {
> > if (!myFavorite && isMyFavorite(i))
> > myFavorite = &i;
> > if ((i%2==0) && (!minEven || i < *minEven)
> > minEven = &i;
> > }
> >
> >
> > It's not a great example. In your transformation below, you mention
> find_if, but that's not exactly what you have here. You don't quit
> early when you have found an isMyFavorite element, so you're finding the
> last element matching the predicate in the range. That's not what find_if
> does, and it may be a lot slower.
> >
> > You can already break this up into:
> >
> > * std::ranges::find_if(values, isMyFavorite)
> > * std::ranges::min_element(values, values | std::views::filter(even))
>
> It would be good to have an actual catalog of standard library functions
> that would be in scope for the new facility. I guess that interface
> surface would probably be quite large. The effort of standardizing this
> is likely going to be a multi-year effort, are you ready for that?
>
> > You're saying that you would like to do it in a single pass, but have
> you actually measured whether that would improve performance meaningfully
> in any scenario?
>
> I agree with the idea of benchmarking this single-pass idea (via a
> hand-written loop)
> against the traditional multi-pass approach.
>
> My guess is that, for carefully chosen algorithms, it does make a
> difference for
> large ranges (> 1G elements) where your throughput is limited by main
> memory bandwidth.
> For a lot of smaller situations, my guess is that the sequential memory
> access
> pattern will trigger the prefetcher of the CPU, making cache locality
> concerns
> almost moot.
>
> Jens
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>


-- 
Üdv.: Máté

Received on 2026-04-27 11:23:14