C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Simd iterator/view

From: Andreas Ringlstetter <andreas.ringlstetter_at_[hidden]>
Date: Wed, 6 Dec 2023 07:53:41 +0100
> it's a bit error prone to iterate over a container by SIMD chunks

What you are trying is similar to an older suggestion (never refined into a
full proposal) to extend the existing basic containers / their iterators by
something like a `std::pair<subrange<wrap_contigous<iterator>>,
subrange<iterator>> chunk(iterator begin, iterator end,
 contiguous_memory_tag{})` where the first returned chunk was guaranteed to
be contiguous memory regardless of whether the container itself is in its
entirety. That would be a lot easier to extend by different tags such as
`mid_point_heuristic_tag` to facilitate the different partition strategies
by various algorithms. (Actually this interface was designed to permit
perfect auto-vectorization and parallelization on *any* container type or
range, with a fallback of breaking down to a `std::next` for non-contiguous
containers that don't have better heuristics and ideal defaults if the
backing container happens to be contiguous.)

Or in the case of the simd support: `simd_aligned_tag`, even though the
signature would rather extend to `std::tuple<subrange<iterator>,
 subrange<wrap_simd<iterator>>, subrange<iterator> > chunk(iterator begin,
iterator end, simd_aligned_tag {}), given that it may not be possible to
find a suitable, contiguous block for simd which aligns directly to the
begin of the input range, necessitating the need to isolate a head as well.
This first range can't contain any further SIMD suitable sub-ranges, but
the third one can. Except there may be mis-aligned sub-ranges in between.

This API was supposed to be called iteratively, and hierarchically with
different tags, to break down the head/tail all the way from
auto-parallelization over simple loop unrolling down to straight SIMD by
auto-vectorization. It also applies to manual vectorization, I suppose?

Usage example:

using T = std::queue<int>;
T foo;
std::ranges::subrange<T::iterator> tail(foo.begin(), foo.end());
do
{
std::ranges::subrange<T::iterator> head;
std::ranges::subrange<wrap_simd<T::iterator>> body;
std::tie(head, body, tail) = chunk(tail.begin(), tail.end(),
simd_aligned_tag{});
for(const auto& scalar : head)
{
// Non-SIMD head
}
for(const auto& block : body)
{
// SIMD block
}
} while(tail.begin() != tail.end());

The exact same pattern is also applicable with the other tags, except the
head should always end up empty.

> If we have a simd_begin and simd_end defined, it's only reasonable to
have a views::by_simd adaptor

Not a bad idea in principle, but as you've noticed you still have the issue
that you are not coping with either non-contiguous underlying containers,
nor with the potentially mis-aligned (not every platform may have efficient
unaligned loads) head or odd-sized tail.

That's not a good fit (in terms of ease of use) for a view - correctly
processing the skipped parts is suddenly hard.

If you were to pack this into a view, you would have to return both the
simd chunks as well as the non-simd scalar values. That again makes it hard
to correctly chain this view as you need a variant-rich type.
However, it should still be possible for the compiler to deduce - when
unrolling - that the scalar alternative can't occur in the middle of the
iteration for a base container, thereby eliminating a necessary inner
`std::visit`?

Giving it two complementary views in order to catch both the "aligned" and
the "stray" chunks is not a good fit either - doesn't work with ranges you
can't restart.

This is kind of hitting a blind spot in the ranges library - occasionally
you NEED to have divergent processing before moving back down to a single
scalar path.


Am Mi., 8. Nov. 2023 um 22:06 Uhr schrieb Tor Shepherd via Std-Proposals <
std-proposals_at_[hidden]>:

> Hi,
>
> I've been playing around a lot with the awesome std::experimental::simd
> library in GCC >11, but it's a bit error prone to iterate over a container
> by SIMD chunks:
>
> using intv = std::experimental::fixed_size_simd<int, 4>;
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> // am I sliding over by the right amount? Going past the end?
> for (size_t i = 0U; i < v.size() i += intv::size())
> {
> intv block{&v[i], std::experimental::element_aligned}; // am I
> accessing out of bounds?
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
> block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, ?
>
> This example code would access past-the-end of the vector with that end
> condition. However, if there were a simd_iterator defined, you could do
> this more safely:
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> for (auto blockIter = simd_begin<intv>(v, -1); blockIter <
> simd_end<intv>(v); ++blockIter)
> {
> intv block = *blockIter;
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
> block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, -1
>
> This iterator would just wrap the logic from before, with keeping track of
> the missing remainder (similar to how strided_view works), and then for the
> last SIMD block, could do masked loading from only the valid memory and set
> the uninitialized padding values with a user default.
>
> If we have a simd_begin and simd_end defined, it's only reasonable to have
> a views::by_simd adaptor:
>
> std::vector<int> v{1, 2, 3, 4, 5, 6, 7};
> for (auto block : v | views::by_simd<intv>(-1))
> {
> std::print("{}, {}, {}, {}\n", block[0], block[1], block[2],
> block[3]);
> }
> // prints:
> // 1, 2, 3, 4
> // 5, 6, 7, -1
>
> From what I've seen this "SIMD-chunk over contiguous range" is a very
> common use case for SIMD.
>
> Is there interest in this idea?
>
> As a side effect, this enables some cool usage with standard algorithms:
>
> template<typename T>
> T simd_inner_product(std::span<T> a, std::span<T> b) {
> using S = std::experimental::native_simd<T>;
> return std::experimental::reduce(
> std::inner_product(simd_begin<S>(a, 0), simd_end<S>(a),
> simd_begin<S>(b, 0), S{}));
> }
>
> On my system (tigerlake, gcc10, -mavx2 -mfma, T = float) this is about 3x
> faster than std::inner_product (though I could have done something wrong to
> prevent GCC from autovectorizing)
>
> (If this is not the right forum to share this idea, I apologize 😅. Please
> let me know where to direct my proposal)
>
> Thanks for reading,
>
> --
> Tor Shepherd
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2023-12-06 06:53:56