ISOCPP std-proposals List: [std-proposals] Simd iterator/view

From: Tor Shepherd <tor.aksel.shepherd_at_[hidden]>
Date: Wed, 8 Nov 2023 16:06:29 -0500

Hi,

I've been playing around a lot with the awesome std::experimental::simd
library in GCC >11, but it's a bit error prone to iterate over a container
by SIMD chunks:

using intv = std::experimental::fixed_size_simd<int, 4>;

std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
// am I sliding over by the right amount? Going past the end?
for (size_t i = 0U; i < v.size() i += intv::size())
{
    intv block{&v[i], std::experimental::element_aligned}; // am I
accessing out of bounds?
    std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, ?

This example code would access past-the-end of the vector with that end
condition. However, if there were a simd_iterator defined, you could do
this more safely:

std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
for (auto blockIter = simd_begin<intv>(v, -1); blockIter <
simd_end<intv>(v); ++blockIter)
{
    intv block = *blockIter;
    std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, -1

This iterator would just wrap the logic from before, with keeping track of
the missing remainder (similar to how strided_view works), and then for the
last SIMD block, could do masked loading from only the valid memory and set
the uninitialized padding values with a user default.

If we have a simd_begin and simd_end defined, it's only reasonable to have
a views::by_simd adaptor:

std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
for (auto block : v | views::by_simd<intv>(-1))
{
    std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, -1

>From what I've seen this "SIMD-chunk over contiguous range" is a very
common use case for SIMD.

Is there interest in this idea?

As a side effect, this enables some cool usage with standard algorithms:

template<typename T>
T simd_inner_product(std::span<T> a, std::span<T> b) {
    using S = std::experimental::native_simd<T>;
    return std::experimental::reduce(
        std::inner_product(simd_begin<S>(a, 0), simd_end<S>(a),
simd_begin<S>(b, 0), S{}));
}

On my system (tigerlake, gcc10, -mavx2 -mfma, T = float) this is about 3x
faster than std::inner_product (though I could have done something wrong to
prevent GCC from autovectorizing)

(If this is not the right forum to share this idea, I apologize 😅. Please
let me know where to direct my proposal)

Thanks for reading,

-- 
Tor Shepherd

Received on 2023-11-08 21:06:41