Date: Mon, 14 Aug 2023 13:28:38 -0400
On Mon, Aug 14, 2023 at 12:25 PM Shumpei Shiina via Std-Discussion
<std-discussion_at_[hidden]> wrote:
>
> Hi,
>
> The standard says `init` should be accumulated only once for parallel STL (e.g., reduce(), transform_reduce(), exclusive_scan()), but this is contradicting the concept of "monoid", which is often used for functional programming for parallel patterns.
>
> std::reduce():
> ```
> template<class ExecutionPolicy, class ForwardIterator, class T, class BinaryOperation>
> T reduce(ExecutionPolicy&& exec,
> ForwardIterator first, ForwardIterator last, T init,
> BinaryOperation binary_op);
> ```
>
> In the above std::reduce() example, the pair of `binary_op` and `init` does not constitute a (commutative) monoid, because `init` may not be an "identity" element.
>
> An identity element satisfies the following:
> binary_op(identity, x) = x
> binary_op(x, identity) = x
>
> This means that the final result does not change regardless of how many times the identity element is accumulated.
> This property is beneficial for parallel implementations.
>
> With an identity element, std::reduce() can be straightforwardly parallelized for each thread:
> ```
> // [first, last) is a subrange given to this thread
> T acc = identity;
> for (auto it = first; it != last; ++it) {
> acc = binary_op(acc, *it);
> }
> // combine `acc` with other threads
> ```
>
> However, in reality, STL does not have an identity element, and `init` should be accumulated only once.
> Thus, the code will look like:
> ```
> T acc = *first;
> for (auto it = first + 1; it != last; ++it) {
> acc = binary_op(acc, *it);
> }
> // combine `acc` with other threads
> ```
>
> However, according to the following statement,
>
> > All of binary_op(init, *first), binary_op(*first, init), binary_op(init, init), and binary_op(*first, *first) shall be convertible to T.
>
> the type of each value (*first) is not necessarily T.
>
> Thus, initialization of acc for each thread (`T acc = *first;`) may not be correct.
> It should be changed to `T acc = binary_op(*first, *(first+1))`, but what if each thread is assigned only one element?
>
> In fact, an existing STL implementation assumes `*first` is of type T: https://github.com/KhronosGroup/SyclParallelSTL/blob/b2cdb3580cc581172ed4de133b05d112afde5b50/include/sycl/algorithm/buffer_algorithms.hpp#L206
>
> Therefore, I think the standard may want to consider this issue either:
> - by assuming `*first` is of type T for parallel execution, or
> - by taking an argument for an identity element to construct a monoid.
There are many overloads of these functions. There's a version of
`reduce` which does not take an `init`. This one already
assumes/declares that the range's value type is the return type, and
it will be the type of the implied `init` object. So the cited version
is a legitimate implementation of those overloads.
Just not of the other ones.
Also, wouldn't it be possible to accumulate all of the parallel stuff
first and then add `init` to the result at the end in series?
<std-discussion_at_[hidden]> wrote:
>
> Hi,
>
> The standard says `init` should be accumulated only once for parallel STL (e.g., reduce(), transform_reduce(), exclusive_scan()), but this is contradicting the concept of "monoid", which is often used for functional programming for parallel patterns.
>
> std::reduce():
> ```
> template<class ExecutionPolicy, class ForwardIterator, class T, class BinaryOperation>
> T reduce(ExecutionPolicy&& exec,
> ForwardIterator first, ForwardIterator last, T init,
> BinaryOperation binary_op);
> ```
>
> In the above std::reduce() example, the pair of `binary_op` and `init` does not constitute a (commutative) monoid, because `init` may not be an "identity" element.
>
> An identity element satisfies the following:
> binary_op(identity, x) = x
> binary_op(x, identity) = x
>
> This means that the final result does not change regardless of how many times the identity element is accumulated.
> This property is beneficial for parallel implementations.
>
> With an identity element, std::reduce() can be straightforwardly parallelized for each thread:
> ```
> // [first, last) is a subrange given to this thread
> T acc = identity;
> for (auto it = first; it != last; ++it) {
> acc = binary_op(acc, *it);
> }
> // combine `acc` with other threads
> ```
>
> However, in reality, STL does not have an identity element, and `init` should be accumulated only once.
> Thus, the code will look like:
> ```
> T acc = *first;
> for (auto it = first + 1; it != last; ++it) {
> acc = binary_op(acc, *it);
> }
> // combine `acc` with other threads
> ```
>
> However, according to the following statement,
>
> > All of binary_op(init, *first), binary_op(*first, init), binary_op(init, init), and binary_op(*first, *first) shall be convertible to T.
>
> the type of each value (*first) is not necessarily T.
>
> Thus, initialization of acc for each thread (`T acc = *first;`) may not be correct.
> It should be changed to `T acc = binary_op(*first, *(first+1))`, but what if each thread is assigned only one element?
>
> In fact, an existing STL implementation assumes `*first` is of type T: https://github.com/KhronosGroup/SyclParallelSTL/blob/b2cdb3580cc581172ed4de133b05d112afde5b50/include/sycl/algorithm/buffer_algorithms.hpp#L206
>
> Therefore, I think the standard may want to consider this issue either:
> - by assuming `*first` is of type T for parallel execution, or
> - by taking an argument for an identity element to construct a monoid.
There are many overloads of these functions. There's a version of
`reduce` which does not take an `init`. This one already
assumes/declares that the range's value type is the return type, and
it will be the type of the implied `init` object. So the cited version
is a legitimate implementation of those overloads.
Just not of the other ones.
Also, wouldn't it be possible to accumulate all of the parallel stuff
first and then add `init` to the result at the end in series?
Received on 2023-08-14 17:28:50