C++ Logo

std-proposals

Advanced search

Balanced division of iterator range without LegacyRandomAccessIterator trait

From: Andreas Ringlstetter <andreas.ringlstetter_at_[hidden]>
Date: Sun, 6 Sep 2020 08:55:29 +0200
Follow up on
https://stackoverflow.com/questions/62306580/balanced-partition-of-iterator-range-without-legacyrandomaccessiterator

Let's start with a simple example:

    std::vector<int> foo;
    ... // fill it
    auto begin = foo.begin();
    auto end = foo.end();
    auto middle = begin + std::distance(begin, end) / 2;

If the iterators have the `LegacyRandomAccessIterator` trait, ideal
bi-partition of the iterator range is embarrassingly easy, and so is
therefore any form of divide-and-conquer type algorithm. This pattern
is commonly found in the various STL implementations too.

If they provide only the `LegacyForwardIterator` trait, only thing we
can do in constant time is:

    std::set<int> foo;
    ... // fill it
    auto begin = foo.begin();
    auto end = foo.end();
    auto middle = (begin != end) ? ++(begin) : begin;

No chance of processing the iterator range in an even remotely
balanced way without a linear scan.

It's easy to understand that for most containers not providing
`LegacyRandomAccessIterator`, an ideal partition scheme can't be for
free. However, most containers would still be able to provide a
partitioning scheme better than `1:n-1` on average at a constant cost.

Any container based on a self balancing tree would be trivially able
to guarantee a worst-case ratio based on intrinsic implementation
details. Any hash table - be it bucket or multiple round based - still
has a high chance of providing a good partition, just by slicing the
hash table itself in half.

Only imbalanced trees or plain lists are by design unsuitable.

This deficit also shows in STL internal algorithms depending on a
somewhat balanced division, e.g. like
`std::for_each(std::execution::par_unseq, ...)`. Even though most STL
containers in about all implementations have implementation details
which allow constant time partition into remotely balanced chunks, I
have yet to see any STL implementation handling anything except
`LegacyRandomAccessIterator` better than pure single element
round-robin. Commonly resulting in false-sharing of cache lines, and
synchronization overhead far beyond any practical value.

This begs the question, how come that the C++ language provides no
interface to request a better division scheme for any STL containers?

---
For me that issue actually popped up when trying to utilize parallel
algorithms from C++17 for element-wise bulk processing in an
`std::set` container, and it turned out to perform significantly worse
for anything not random-access, due the overhead from the lack of any
divide-and-conquer friendly iteration scheme.
Without a proper division strategy, there are only two possible
implementations, either each thread only processes each n-th element
it has visited with a thread local iterator copy, or all threads use a
synchronization primitive and share the same iterator. First option
burns time on iteration, second suffers from contention. First has the
same lower bound as single-threaded no-op iteration, second even
scales negatively with additional threads due to contention related
issues.
---
This is not something STL implementations can tackle individually
either. While it's certainly possible for the individual
implementations to implement a better division strategy for their
private container implementations, this would still be lacking an
interface for good integration of STL algorithms with non-STL
containers.
---
For reference, the following is about on-par with what the existing
STL implementations do:
    // Generic implementation
    template<
        typename Iter,
        std::enable_if_t<std::is_convertible_v<typename
std::iterator_traits<Iter>::iterator_category,
std::forward_iterator_tag>, int> = 0
    >
    std::pair<std::pair<Iter, Iter>, std::pair<Iter, Iter>>
    divide(const Iter &begin, const Iter &end)
    {
        if constexpr (std::is_convertible_v<typename
std::iterator_traits<Iter>::iterator_category,
std::random_access_iterator_tag>)
        {
            // Ideal bisection
            const Iter middle = end - std::distance(begin, end) / 2;
            return { {begin, middle}, {begin, middle} };
        }
        else
        {
            // Not ideal, but constant time
            auto middle = begin;
            if (middle != end)
            {
                ++middle;
            }
            return { {begin, middle}, {begin, middle} };
        }
    }
It's already extending on existing STL implementations in the sense
that it does allow choosing not only a common middle point, but if
better suited 2 arbitrary sequences of incompatible iterator types
utilizing e.g. the ranges library or any other arbitrary
implementation details.
The motivation for that is to cater for use cases where e.g.
interleaved traversal is better suiting the requirements on data
locality. If that is not deemed worth supporting, then only returning
the approximately ideal middle point would suffice.
Another point to explore is whether an additional "strategy" shall be
added as a parameter. Possible options for this parameter would be
"divide", requesting an ideal bisection, "chunk", requesting
"reasonable" work packages to be sliced from the start of the range
(e.g. to be called in a round-robin fashion by multiple worker
threads), or "interleaved" to request a division pattern best suited
for "par_seq" use cases. If this option is discarded, then the only
recommendable strategy is bisection. If the requested strategy is not
applicable for the container, then "division" is the only reasonably
fall back choice.
---
An attempt to flesh out the actual requirements:
Hard requirements for "{begin1, end1}, {begin2, end2} = divide(begin, end)":
- if begin != end, then begin1 != end1
- it begin == end, then begin1 == end1 and begin2 == end2
- each element q in the original range is in exactly one of the subranges
- for each pair of elements p and q, if index(p) < index(q) in the
original range and p and q are within the same subrange, then index(p)
< index(q) still holds true
- yield division point in amortized linear time on recursive traversal
Soft requirements:
- distance(begin1, end1), distance(begin2, end2) should be close to a
50:50 split, as achievable within the time complexity requirement
- the sub-ranges shall avoid data sharing with each other as much as
possible
- data-locality shall be pursued within each sub-range
Recommended implementation details:
- For trees, each sub-range shall represent a sub-tree with a height
lower than the former lowest common ancestor of the input range
- For bucket based containers, each sub-range shall always be located
at the boundary of a bucket, except when the range was already
intersection a bucket
---
There is yet another utility in this context which would aid working
with container types of unknown complexity, but not a strict
prerequisite for the above proposal:
A heuristic alternative to "std::distance". The exact form we already
have does not have an efficient implementation for anything except
dense, compact allocations where it can be reduced to pointer
arithmetic.
Except in many use cases an exact measurement isn't required at all,
but a guaranteed estimate for an upper bound is plenty to decide
between different traversal strategies. Most STL container
implementations, once again except for pure linked lists, can provide
good estimates in either logarithmic or even constant time.
E.g. for a binary tree, a good estimate can be estimated in log(n)
just based on the tree height of the common subtree containing two
arbitrary nodes.
For hash maps it's even easier, as the sparse properties of the
backing storage can be ignored and the maximum distance is pure
pointer arithmetic.
It's still up to discuss whether a container should make a "guess"
even if the upper bound can't be guaranteed, only estimated. Think
e.g. about skip list structures, which can make a good guess based on
the number of highest tier links in between two nodes, but that
estimate may be fundamentally off in case the data structure has
become unbalanced.
In conjunction with the above proposal, it serves as the foundation
for the "par_unseq" / "par_seq" family of algorithms to make an
educated decision whether to keep distributing horizontally or to
begin processing a given range.

Received on 2020-09-06 01:59:11