std-proposals: Re: Balanced division of iterator range without LegacyRandomAccessIterator trait

From: Andreas Ringlstetter <andreas.ringlstetter_at_[hidden]>
Date: Sun, 6 Sep 2020 22:34:03 +0200

On 06.09.2020 19:52, Jason McKesson via Std-Proposals wrote:

> Given O(1) size information, you could apportion threads by having
> each thread advance their iterator forward by the number of threads
> involved. And of course their initial iterator has to be offset
> appropriately, and they have to check for the end iterator when
> incrementing their own. So while `slice` is probably not the best
> option, having O(1) size for ranges that provide it may be enough to
> build a good solution even for non-random access ranges.

It's actually the bad real life performance of this exact strategy
(stride with offset) which has prompted this thread.

Even for random-access iterators (where advancing by multiple steps has
O(1)), this results in a mix of occasional false-sharing (if the threads
remain in sync), and reloads of already evicted cache lines (if they
diverge). With a tendency to the later, so it effectively results in a
load amplification on memory throughput for sufficiently large
collections. Far from good resource utilization, to put it carefully,
slowdown instead of speedup in wall clock time is quite possible, even
if the time complexity is still O(n/p) in theory.

For iterators with a higher increment cost, from structures such as
trees or sparse structures, this additionally amplifies the cost for the
iteration itself by the number of participating threads. For pointer
heavy structures, and given the width of modern CPUs, this is scaling
horribly to the extent where not only the iteration cost outweighs any
reasonable payload by far, but it's on top also inevitably stalling on
loads as each thread is walking the pointer chain independently.
Effectively not only costing significantly more resources, but also
taking even longer in wall clock time. May still look like it works
reasonably on a small dual core system, breaks down at 4-8 threads, and
only gets so much worse past that point. Up to the point where the wall
clock run time ends up not at the desired O(n/p), but is actually O(n)
on paper, and reliably ends up approaching O(n*p) in real life with
increasing thread counts as it ends up serializing on hardware bottlenecks.

It's still the only strategy eligible for `par_seq` execution
requirements, but it's only beneficial under strict constraints
regarding the cost of the payload in comparison to the cost for
iteration. For `par_unseq`, it's hardly a good match.

The problem with `slice` is, that it is by design not a good choice for
any structure where precise(!) random access is costly. Doesn't matter
if `size` is O(1), if you still can't get the iterator for a specified
index in any reasonable time. So if we can't get the iterator we would
ideally want to have, we are still in need for *some* good iterator at
anything resembling a middle point reasonably well. And that strongly
depends on implementation details of the corresponding data structure,
what it can provide us with.

Just about anything is better than the 1:n-1 step a simple,
deterministic iterator increment could give, but if it ends up costing
O(n) (edge cases on degenerated structures aside) it was way too
expensive. We don't need to know how many elements end up being before
or after, even if we knew the size of the original range, as long as we
have a ballpark number telling us whether another division may be advisable.

Received on 2020-09-06 15:37:36