Date: Mon, 14 Aug 2023 16:03:08 -0400

On Mon, Aug 14, 2023 at 3:05 PM Shumpei Shiina <s417.lama_at_[hidden]> wrote:

>

> > There are many overloads of these functions. There's a version of

> > `reduce` which does not take an `init`. This one already

> > assumes/declares that the range's value type is the return type, and

> > it will be the type of the implied `init` object. So the cited version

> > is a legitimate implementation of those overloads.

>

> > Just not of the other ones.

>

> I understand there are other overloads, and in most use cases, the type of `*first` will be T.

> However, as long as the specification says `*first` may not be of type T, implementations should cover the cases.

>

> > Also, wouldn't it be possible to accumulate all of the parallel stuff

> > first and then add `init` to the result at the end in series?

>

> Even if we do so, we cannot assume `*first` is of type T, which leads to the same issue.

>

> As mentioned in the original post, this will result in the following code:

> ```

> T acc = binary_op(*first, *(first+1));

> for (auto it = first + 2; it != last; ++it) {

> acc = binary_op(acc, *it);

> }

> // combine `acc` with other threads

> ```

Or you can implement it correctly:

```

auto acc = binary_op(*first, *(first+1));

for (auto it = first + 2; it != last; ++it) {

acc = binary_op(acc, *it);

}

// combine `acc` with other threads

```

The compiler knows the type; you don't have to specify it.

> This code assumes at least two elements are processed in each thread, but this can limit potential parallelism in the algorithm.

> This may not be a big problem in std::reduce(), but in more complicated map-reduce patterns (std::reduce_transform), the user-defined transform operator can take a reasonable amount of time.

>

> Suppose that we want to apply std::transform_reduce() to a vector {a1, a2, a3, a4} to compute init + f(a1) + f(a2) + f(a3) + f(a4), where f() is the user-defined transform operator and `+` is the commutative binary operator.

> Since `f()` can take a long time, it is expected for these computations to be load balanced to 4 threads.

> However, if we cannot assume `*first` is of type T (or cannot use a monoid), each thread will require at least two elements, resulting in load imbalance.

Um... why not? The type of `f(*it)` must be the same for each

invocation. Whether it is `T` or some other type `U` is irrelevant;

what matters is that you know the type. Similarly, the result of

`binary_op(f(*it), f(*it2))` is of the same type, for any particular

iterator. It need not be `T` or `U`, but it is again *known*. So

again, what is the problem?

The bug you cited is a bug that could have been fixed by using

deduction rather than assumption.

>

> > There are many overloads of these functions. There's a version of

> > `reduce` which does not take an `init`. This one already

> > assumes/declares that the range's value type is the return type, and

> > it will be the type of the implied `init` object. So the cited version

> > is a legitimate implementation of those overloads.

>

> > Just not of the other ones.

>

> I understand there are other overloads, and in most use cases, the type of `*first` will be T.

> However, as long as the specification says `*first` may not be of type T, implementations should cover the cases.

>

> > Also, wouldn't it be possible to accumulate all of the parallel stuff

> > first and then add `init` to the result at the end in series?

>

> Even if we do so, we cannot assume `*first` is of type T, which leads to the same issue.

>

> As mentioned in the original post, this will result in the following code:

> ```

> T acc = binary_op(*first, *(first+1));

> for (auto it = first + 2; it != last; ++it) {

> acc = binary_op(acc, *it);

> }

> // combine `acc` with other threads

> ```

Or you can implement it correctly:

```

auto acc = binary_op(*first, *(first+1));

for (auto it = first + 2; it != last; ++it) {

acc = binary_op(acc, *it);

}

// combine `acc` with other threads

```

The compiler knows the type; you don't have to specify it.

> This code assumes at least two elements are processed in each thread, but this can limit potential parallelism in the algorithm.

> This may not be a big problem in std::reduce(), but in more complicated map-reduce patterns (std::reduce_transform), the user-defined transform operator can take a reasonable amount of time.

>

> Suppose that we want to apply std::transform_reduce() to a vector {a1, a2, a3, a4} to compute init + f(a1) + f(a2) + f(a3) + f(a4), where f() is the user-defined transform operator and `+` is the commutative binary operator.

> Since `f()` can take a long time, it is expected for these computations to be load balanced to 4 threads.

> However, if we cannot assume `*first` is of type T (or cannot use a monoid), each thread will require at least two elements, resulting in load imbalance.

Um... why not? The type of `f(*it)` must be the same for each

invocation. Whether it is `T` or some other type `U` is irrelevant;

what matters is that you know the type. Similarly, the result of

`binary_op(f(*it), f(*it2))` is of the same type, for any particular

iterator. It need not be `T` or `U`, but it is again *known*. So

again, what is the problem?

The bug you cited is a bug that could have been fixed by using

deduction rather than assumption.

Received on 2023-08-14 20:03:20