Date: Wed, 8 Nov 2023 23:08:59 +0200
On Wed, 8 Nov 2023 at 23:06, Tor Shepherd via Std-Proposals
<std-proposals_at_[hidden]> wrote:
>
> Hi,
>
> I've been playing around a lot with the awesome std::experimental::simd library in GCC >11, but it's a bit error prone to iterate over a container by SIMD chunks:
>
> using intv = std::experimental::fixed_size_simd<int, 4>;
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> // am I sliding over by the right amount? Going past the end?
> for (size_t i = 0U; i < v.size() i += intv::size())
> {
> intv block{&v[i], std::experimental::element_aligned}; // am I accessing out of bounds?
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2], block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, ?
>
> This example code would access past-the-end of the vector with that end condition. However, if there were a simd_iterator defined, you could do this more safely:
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> for (auto blockIter = simd_begin<intv>(v, -1); blockIter < simd_end<intv>(v); ++blockIter)
> {
> intv block = *blockIter;
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2], block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, -1
>
> This iterator would just wrap the logic from before, with keeping track of the missing remainder (similar to how strided_view works), and then for the last SIMD block, could do masked loading from only the valid memory and set the uninitialized padding values with a user default.
>
> If we have a simd_begin and simd_end defined, it's only reasonable to have a views::by_simd adaptor:
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> for (auto block : v | views::by_simd<intv>(-1))
> {
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2], block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, -1
>
> From what I've seen this "SIMD-chunk over contiguous range" is a very common use case for SIMD.
>
> Is there interest in this idea?
>
> As a side effect, this enables some cool usage with standard algorithms:
>
> template<typename T>
> T simd_inner_product(std::span<T> a, std::span<T> b) {
> using S = std::experimental::native_simd<T>;
> return std::experimental::reduce(
> std::inner_product(simd_begin<S>(a, 0), simd_end<S>(a), simd_begin<S>(b, 0), S{}));
> }
>
> On my system (tigerlake, gcc10, -mavx2 -mfma, T = float) this is about 3x faster than std::inner_product (though I could have done something wrong to prevent GCC from autovectorizing)
>
> (If this is not the right forum to share this idea, I apologize 😅. Please let me know where to direct my proposal)
>
> Thanks for reading,
Please check https://isocpp.org/files/papers/D3024R0.html
and talk to its authors.
<std-proposals_at_[hidden]> wrote:
>
> Hi,
>
> I've been playing around a lot with the awesome std::experimental::simd library in GCC >11, but it's a bit error prone to iterate over a container by SIMD chunks:
>
> using intv = std::experimental::fixed_size_simd<int, 4>;
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> // am I sliding over by the right amount? Going past the end?
> for (size_t i = 0U; i < v.size() i += intv::size())
> {
> intv block{&v[i], std::experimental::element_aligned}; // am I accessing out of bounds?
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2], block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, ?
>
> This example code would access past-the-end of the vector with that end condition. However, if there were a simd_iterator defined, you could do this more safely:
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> for (auto blockIter = simd_begin<intv>(v, -1); blockIter < simd_end<intv>(v); ++blockIter)
> {
> intv block = *blockIter;
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2], block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, -1
>
> This iterator would just wrap the logic from before, with keeping track of the missing remainder (similar to how strided_view works), and then for the last SIMD block, could do masked loading from only the valid memory and set the uninitialized padding values with a user default.
>
> If we have a simd_begin and simd_end defined, it's only reasonable to have a views::by_simd adaptor:
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> for (auto block : v | views::by_simd<intv>(-1))
> {
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2], block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, -1
>
> From what I've seen this "SIMD-chunk over contiguous range" is a very common use case for SIMD.
>
> Is there interest in this idea?
>
> As a side effect, this enables some cool usage with standard algorithms:
>
> template<typename T>
> T simd_inner_product(std::span<T> a, std::span<T> b) {
> using S = std::experimental::native_simd<T>;
> return std::experimental::reduce(
> std::inner_product(simd_begin<S>(a, 0), simd_end<S>(a), simd_begin<S>(b, 0), S{}));
> }
>
> On my system (tigerlake, gcc10, -mavx2 -mfma, T = float) this is about 3x faster than std::inner_product (though I could have done something wrong to prevent GCC from autovectorizing)
>
> (If this is not the right forum to share this idea, I apologize 😅. Please let me know where to direct my proposal)
>
> Thanks for reading,
Please check https://isocpp.org/files/papers/D3024R0.html
and talk to its authors.
Received on 2023-11-08 21:09:12