Date: Tue, 15 Aug 2023 06:06:02 +0900
Let me explain the problem step by step.
Suppose that only two threads exist for simplicity.
A parallel implementation of std::transform_reduce() will look like this:
```
template<class ExecutionPolicy,
class ForwardIterator, class T,
class BinaryOperation, class UnaryOperation>
T transform_reduce(ExecutionPolicy&& exec,
ForwardIterator first, ForwardIterator last,
T init,
BinaryOperation binary_op, UnaryOperation unary_op) {
ForwardIterator middle = first + (last - first) / 2;
// Call the following two functions in parallel
T acc1 = transform_reduce_impl(first, middle, binary_op, unary_op);
T acc2 = transform_reduce_impl(middle, last, binary_op, unary_op);
return binary_op(init, binary_op(acc1, acc2));
}
```
This simply divides the input range into two subranges and calls
`transform_reduce_impl()` (below) in parallel.
This expects type T as the return value of `transform_reduce_impl()` and
combines the results later.
Let's define the types as follows:
U :: return type of unary_op(*first)
T :: type of `init` (also the return type of binary_op(U, U), binary_op(T,
T), binary_op(U, T), binary_op(T, U))
The problem happens in the implementation of `transform_reduce_impl()`.
The first case we can consider is:
```
template<class ForwardIterator,
class BinaryOperation, class UnaryOperation>
auto transform_reduce_impl(ForwardIterator first, ForwardIterator last,
BinaryOperation binary_op, UnaryOperation
unary_op) {
auto acc = unary_op(*first); // type U
for (auto it = first + 1; it != last; ++it) {
acc = binary_op(acc, unary_op(*it)); // binary_op(U, U) -> T; incorrect
}
return acc;
}
```
The definition of `acc` is of type U, but `binary_op()` will produce type
T, which is different from the original definition of `acc`.
If U == T, this implementation is valid (the case I cited before).
To define `acc` of type T from the beginning, we need at least two elements
for each thread:
```
template<class ForwardIterator,
class BinaryOperation, class UnaryOperation>
auto transform_reduce_impl(ForwardIterator first, ForwardIterator last,
BinaryOperation binary_op, UnaryOperation
unary_op) {
auto acc = binary_op(unary_op(*first), unary_op(*(first+1))); // type T
for (auto it = first + 2; it != last; ++it) {
acc = binary_op(acc, unary_op(*it)); // binary_op(T, U) -> T; correct
}
return acc;
}
```
This code is valid, but this can cause the load imbalance problem as
mentioned before.
That is, in order for transform_reduce_impl() to return type T, we need at
least two elements in [first, last).
Certainly, we can check if the number of elements is greater than one
before calling transform_reduce_impl(), but this sacrifices parallelism
between the two elements.
2023年8月15日(火) 5:03 Jason McKesson via Std-Discussion <
std-discussion_at_[hidden]>:
> On Mon, Aug 14, 2023 at 3:05 PM Shumpei Shiina <s417.lama_at_[hidden]>
> wrote:
> >
> > > There are many overloads of these functions. There's a version of
> > > `reduce` which does not take an `init`. This one already
> > > assumes/declares that the range's value type is the return type, and
> > > it will be the type of the implied `init` object. So the cited version
> > > is a legitimate implementation of those overloads.
> >
> > > Just not of the other ones.
> >
> > I understand there are other overloads, and in most use cases, the type
> of `*first` will be T.
> > However, as long as the specification says `*first` may not be of type
> T, implementations should cover the cases.
> >
> > > Also, wouldn't it be possible to accumulate all of the parallel stuff
> > > first and then add `init` to the result at the end in series?
> >
> > Even if we do so, we cannot assume `*first` is of type T, which leads to
> the same issue.
> >
> > As mentioned in the original post, this will result in the following
> code:
> > ```
> > T acc = binary_op(*first, *(first+1));
> > for (auto it = first + 2; it != last; ++it) {
> > acc = binary_op(acc, *it);
> > }
> > // combine `acc` with other threads
> > ```
>
> Or you can implement it correctly:
>
> ```
> auto acc = binary_op(*first, *(first+1));
> for (auto it = first + 2; it != last; ++it) {
> acc = binary_op(acc, *it);
> }
> // combine `acc` with other threads
> ```
>
> The compiler knows the type; you don't have to specify it.
>
> > This code assumes at least two elements are processed in each thread,
> but this can limit potential parallelism in the algorithm.
> > This may not be a big problem in std::reduce(), but in more complicated
> map-reduce patterns (std::reduce_transform), the user-defined transform
> operator can take a reasonable amount of time.
> >
> > Suppose that we want to apply std::transform_reduce() to a vector {a1,
> a2, a3, a4} to compute init + f(a1) + f(a2) + f(a3) + f(a4), where f() is
> the user-defined transform operator and `+` is the commutative binary
> operator.
> > Since `f()` can take a long time, it is expected for these computations
> to be load balanced to 4 threads.
> > However, if we cannot assume `*first` is of type T (or cannot use a
> monoid), each thread will require at least two elements, resulting in load
> imbalance.
>
> Um... why not? The type of `f(*it)` must be the same for each
> invocation. Whether it is `T` or some other type `U` is irrelevant;
> what matters is that you know the type. Similarly, the result of
> `binary_op(f(*it), f(*it2))` is of the same type, for any particular
> iterator. It need not be `T` or `U`, but it is again *known*. So
> again, what is the problem?
>
> The bug you cited is a bug that could have been fixed by using
> deduction rather than assumption.
> --
> Std-Discussion mailing list
> Std-Discussion_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-discussion
>
Suppose that only two threads exist for simplicity.
A parallel implementation of std::transform_reduce() will look like this:
```
template<class ExecutionPolicy,
class ForwardIterator, class T,
class BinaryOperation, class UnaryOperation>
T transform_reduce(ExecutionPolicy&& exec,
ForwardIterator first, ForwardIterator last,
T init,
BinaryOperation binary_op, UnaryOperation unary_op) {
ForwardIterator middle = first + (last - first) / 2;
// Call the following two functions in parallel
T acc1 = transform_reduce_impl(first, middle, binary_op, unary_op);
T acc2 = transform_reduce_impl(middle, last, binary_op, unary_op);
return binary_op(init, binary_op(acc1, acc2));
}
```
This simply divides the input range into two subranges and calls
`transform_reduce_impl()` (below) in parallel.
This expects type T as the return value of `transform_reduce_impl()` and
combines the results later.
Let's define the types as follows:
U :: return type of unary_op(*first)
T :: type of `init` (also the return type of binary_op(U, U), binary_op(T,
T), binary_op(U, T), binary_op(T, U))
The problem happens in the implementation of `transform_reduce_impl()`.
The first case we can consider is:
```
template<class ForwardIterator,
class BinaryOperation, class UnaryOperation>
auto transform_reduce_impl(ForwardIterator first, ForwardIterator last,
BinaryOperation binary_op, UnaryOperation
unary_op) {
auto acc = unary_op(*first); // type U
for (auto it = first + 1; it != last; ++it) {
acc = binary_op(acc, unary_op(*it)); // binary_op(U, U) -> T; incorrect
}
return acc;
}
```
The definition of `acc` is of type U, but `binary_op()` will produce type
T, which is different from the original definition of `acc`.
If U == T, this implementation is valid (the case I cited before).
To define `acc` of type T from the beginning, we need at least two elements
for each thread:
```
template<class ForwardIterator,
class BinaryOperation, class UnaryOperation>
auto transform_reduce_impl(ForwardIterator first, ForwardIterator last,
BinaryOperation binary_op, UnaryOperation
unary_op) {
auto acc = binary_op(unary_op(*first), unary_op(*(first+1))); // type T
for (auto it = first + 2; it != last; ++it) {
acc = binary_op(acc, unary_op(*it)); // binary_op(T, U) -> T; correct
}
return acc;
}
```
This code is valid, but this can cause the load imbalance problem as
mentioned before.
That is, in order for transform_reduce_impl() to return type T, we need at
least two elements in [first, last).
Certainly, we can check if the number of elements is greater than one
before calling transform_reduce_impl(), but this sacrifices parallelism
between the two elements.
2023年8月15日(火) 5:03 Jason McKesson via Std-Discussion <
std-discussion_at_[hidden]>:
> On Mon, Aug 14, 2023 at 3:05 PM Shumpei Shiina <s417.lama_at_[hidden]>
> wrote:
> >
> > > There are many overloads of these functions. There's a version of
> > > `reduce` which does not take an `init`. This one already
> > > assumes/declares that the range's value type is the return type, and
> > > it will be the type of the implied `init` object. So the cited version
> > > is a legitimate implementation of those overloads.
> >
> > > Just not of the other ones.
> >
> > I understand there are other overloads, and in most use cases, the type
> of `*first` will be T.
> > However, as long as the specification says `*first` may not be of type
> T, implementations should cover the cases.
> >
> > > Also, wouldn't it be possible to accumulate all of the parallel stuff
> > > first and then add `init` to the result at the end in series?
> >
> > Even if we do so, we cannot assume `*first` is of type T, which leads to
> the same issue.
> >
> > As mentioned in the original post, this will result in the following
> code:
> > ```
> > T acc = binary_op(*first, *(first+1));
> > for (auto it = first + 2; it != last; ++it) {
> > acc = binary_op(acc, *it);
> > }
> > // combine `acc` with other threads
> > ```
>
> Or you can implement it correctly:
>
> ```
> auto acc = binary_op(*first, *(first+1));
> for (auto it = first + 2; it != last; ++it) {
> acc = binary_op(acc, *it);
> }
> // combine `acc` with other threads
> ```
>
> The compiler knows the type; you don't have to specify it.
>
> > This code assumes at least two elements are processed in each thread,
> but this can limit potential parallelism in the algorithm.
> > This may not be a big problem in std::reduce(), but in more complicated
> map-reduce patterns (std::reduce_transform), the user-defined transform
> operator can take a reasonable amount of time.
> >
> > Suppose that we want to apply std::transform_reduce() to a vector {a1,
> a2, a3, a4} to compute init + f(a1) + f(a2) + f(a3) + f(a4), where f() is
> the user-defined transform operator and `+` is the commutative binary
> operator.
> > Since `f()` can take a long time, it is expected for these computations
> to be load balanced to 4 threads.
> > However, if we cannot assume `*first` is of type T (or cannot use a
> monoid), each thread will require at least two elements, resulting in load
> imbalance.
>
> Um... why not? The type of `f(*it)` must be the same for each
> invocation. Whether it is `T` or some other type `U` is irrelevant;
> what matters is that you know the type. Similarly, the result of
> `binary_op(f(*it), f(*it2))` is of the same type, for any particular
> iterator. It need not be `T` or `U`, but it is again *known*. So
> again, what is the problem?
>
> The bug you cited is a bug that could have been fixed by using
> deduction rather than assumption.
> --
> Std-Discussion mailing list
> Std-Discussion_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-discussion
>
Received on 2023-08-14 21:06:15