Date: Wed, 8 Nov 2023 16:06:29 -0500
Hi,
I've been playing around a lot with the awesome std::experimental::simd
library in GCC >11, but it's a bit error prone to iterate over a container
by SIMD chunks:
using intv = std::experimental::fixed_size_simd<int, 4>;
std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
// am I sliding over by the right amount? Going past the end?
for (size_t i = 0U; i < v.size() i += intv::size())
{
intv block{&v[i], std::experimental::element_aligned}; // am I
accessing out of bounds?
std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, ?
This example code would access past-the-end of the vector with that end
condition. However, if there were a simd_iterator defined, you could do
this more safely:
std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
for (auto blockIter = simd_begin<intv>(v, -1); blockIter <
simd_end<intv>(v); ++blockIter)
{
intv block = *blockIter;
std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, -1
This iterator would just wrap the logic from before, with keeping track of
the missing remainder (similar to how strided_view works), and then for the
last SIMD block, could do masked loading from only the valid memory and set
the uninitialized padding values with a user default.
If we have a simd_begin and simd_end defined, it's only reasonable to have
a views::by_simd adaptor:
std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
for (auto block : v | views::by_simd<intv>(-1))
{
std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, -1
>From what I've seen this "SIMD-chunk over contiguous range" is a very
common use case for SIMD.
Is there interest in this idea?
As a side effect, this enables some cool usage with standard algorithms:
template<typename T>
T simd_inner_product(std::span<T> a, std::span<T> b) {
using S = std::experimental::native_simd<T>;
return std::experimental::reduce(
std::inner_product(simd_begin<S>(a, 0), simd_end<S>(a),
simd_begin<S>(b, 0), S{}));
}
On my system (tigerlake, gcc10, -mavx2 -mfma, T = float) this is about 3x
faster than std::inner_product (though I could have done something wrong to
prevent GCC from autovectorizing)
(If this is not the right forum to share this idea, I apologize 😅. Please
let me know where to direct my proposal)
Thanks for reading,
I've been playing around a lot with the awesome std::experimental::simd
library in GCC >11, but it's a bit error prone to iterate over a container
by SIMD chunks:
using intv = std::experimental::fixed_size_simd<int, 4>;
std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
// am I sliding over by the right amount? Going past the end?
for (size_t i = 0U; i < v.size() i += intv::size())
{
intv block{&v[i], std::experimental::element_aligned}; // am I
accessing out of bounds?
std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, ?
This example code would access past-the-end of the vector with that end
condition. However, if there were a simd_iterator defined, you could do
this more safely:
std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
for (auto blockIter = simd_begin<intv>(v, -1); blockIter <
simd_end<intv>(v); ++blockIter)
{
intv block = *blockIter;
std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, -1
This iterator would just wrap the logic from before, with keeping track of
the missing remainder (similar to how strided_view works), and then for the
last SIMD block, could do masked loading from only the valid memory and set
the uninitialized padding values with a user default.
If we have a simd_begin and simd_end defined, it's only reasonable to have
a views::by_simd adaptor:
std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
for (auto block : v | views::by_simd<intv>(-1))
{
std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
block[3]);
}
// prints:
// 1, 2, 3, 4
// 5, 6, 7, -1
>From what I've seen this "SIMD-chunk over contiguous range" is a very
common use case for SIMD.
Is there interest in this idea?
As a side effect, this enables some cool usage with standard algorithms:
template<typename T>
T simd_inner_product(std::span<T> a, std::span<T> b) {
using S = std::experimental::native_simd<T>;
return std::experimental::reduce(
std::inner_product(simd_begin<S>(a, 0), simd_end<S>(a),
simd_begin<S>(b, 0), S{}));
}
On my system (tigerlake, gcc10, -mavx2 -mfma, T = float) this is about 3x
faster than std::inner_product (though I could have done something wrong to
prevent GCC from autovectorizing)
(If this is not the right forum to share this idea, I apologize 😅. Please
let me know where to direct my proposal)
Thanks for reading,
-- Tor Shepherd
Received on 2023-11-08 21:06:41