C++ Logo

std-discussion

Advanced search

Re: Parallel STL std::reduce() and monoid

From: Jason McKesson <jmckesson_at_[hidden]>
Date: Mon, 14 Aug 2023 16:03:08 -0400
On Mon, Aug 14, 2023 at 3:05 PM Shumpei Shiina <s417.lama_at_[hidden]> wrote:
>
> > There are many overloads of these functions. There's a version of
> > `reduce` which does not take an `init`. This one already
> > assumes/declares that the range's value type is the return type, and
> > it will be the type of the implied `init` object. So the cited version
> > is a legitimate implementation of those overloads.
>
> > Just not of the other ones.
>
> I understand there are other overloads, and in most use cases, the type of `*first` will be T.
> However, as long as the specification says `*first` may not be of type T, implementations should cover the cases.
>
> > Also, wouldn't it be possible to accumulate all of the parallel stuff
> > first and then add `init` to the result at the end in series?
>
> Even if we do so, we cannot assume `*first` is of type T, which leads to the same issue.
>
> As mentioned in the original post, this will result in the following code:
> ```
> T acc = binary_op(*first, *(first+1));
> for (auto it = first + 2; it != last; ++it) {
> acc = binary_op(acc, *it);
> }
> // combine `acc` with other threads
> ```

Or you can implement it correctly:

```
auto acc = binary_op(*first, *(first+1));
for (auto it = first + 2; it != last; ++it) {
  acc = binary_op(acc, *it);
}
// combine `acc` with other threads
```

The compiler knows the type; you don't have to specify it.

> This code assumes at least two elements are processed in each thread, but this can limit potential parallelism in the algorithm.
> This may not be a big problem in std::reduce(), but in more complicated map-reduce patterns (std::reduce_transform), the user-defined transform operator can take a reasonable amount of time.
>
> Suppose that we want to apply std::transform_reduce() to a vector {a1, a2, a3, a4} to compute init + f(a1) + f(a2) + f(a3) + f(a4), where f() is the user-defined transform operator and `+` is the commutative binary operator.
> Since `f()` can take a long time, it is expected for these computations to be load balanced to 4 threads.
> However, if we cannot assume `*first` is of type T (or cannot use a monoid), each thread will require at least two elements, resulting in load imbalance.

Um... why not? The type of `f(*it)` must be the same for each
invocation. Whether it is `T` or some other type `U` is irrelevant;
what matters is that you know the type. Similarly, the result of
`binary_op(f(*it), f(*it2))` is of the same type, for any particular
iterator. It need not be `T` or `U`, but it is again *known*. So
again, what is the problem?

The bug you cited is a bug that could have been fixed by using
deduction rather than assumption.

Received on 2023-08-14 20:03:20