std-proposals

From: Andreas Ringlstetter <andreas.ringlstetter_at_[hidden]> · Date: Sun, 6 Sep 2020 08:55:29 +0200

---
For me that issue actually popped up when trying to utilize parallel
algorithms from C++17 for element-wise bulk processing in an
`std::set` container, and it turned out to perform significantly worse
for anything not random-access, due the overhead from the lack of any
divide-and-conquer friendly iteration scheme.
Without a proper division strategy, there are only two possible
implementations, either each thread only processes each n-th element
it has visited with a thread local iterator copy, or all threads use a
synchronization primitive and share the same iterator. First option
burns time on iteration, second suffers from contention. First has the
same lower bound as single-threaded no-op iteration, second even
scales negatively with additional threads due to contention related
issues.
---
This is not something STL implementations can tackle individually
either. While it's certainly possible for the individual
implementations to implement a better division strategy for their
private container implementations, this would still be lacking an
interface for good integration of STL algorithms with non-STL
containers.
---
For reference, the following is about on-par with what the existing
STL implementations do:
    // Generic implementation
    template<
        typename Iter,
        std::enable_if_t<std::is_convertible_v<typename
std::iterator_traits<Iter>::iterator_category,
std::forward_iterator_tag>, int> = 0
    >
    std::pair<std::pair<Iter, Iter>, std::pair<Iter, Iter>>
    divide(const Iter &begin, const Iter &end)
    {
        if constexpr (std::is_convertible_v<typename
std::iterator_traits<Iter>::iterator_category,
std::random_access_iterator_tag>)
        {
            // Ideal bisection
            const Iter middle = end - std::distance(begin, end) / 2;
            return { {begin, middle}, {begin, middle} };
        }
        else
        {
            // Not ideal, but constant time
            auto middle = begin;
            if (middle != end)
            {
                ++middle;
            }
            return { {begin, middle}, {begin, middle} };
        }
    }
It's already extending on existing STL implementations in the sense
that it does allow choosing not only a common middle point, but if
better suited 2 arbitrary sequences of incompatible iterator types
utilizing e.g. the ranges library or any other arbitrary
implementation details.
The motivation for that is to cater for use cases where e.g.
interleaved traversal is better suiting the requirements on data
locality. If that is not deemed worth supporting, then only returning
the approximately ideal middle point would suffice.
Another point to explore is whether an additional "strategy" shall be
added as a parameter. Possible options for this parameter would be
"divide", requesting an ideal bisection, "chunk", requesting
"reasonable" work packages to be sliced from the start of the range
(e.g. to be called in a round-robin fashion by multiple worker
threads), or "interleaved" to request a division pattern best suited
for "par_seq" use cases. If this option is discarded, then the only
recommendable strategy is bisection. If the requested strategy is not
applicable for the container, then "division" is the only reasonably
fall back choice.
---
An attempt to flesh out the actual requirements:
Hard requirements for "{begin1, end1}, {begin2, end2} = divide(begin, end)":
- if begin != end, then begin1 != end1
- it begin == end, then begin1 == end1 and begin2 == end2
- each element q in the original range is in exactly one of the subranges
- for each pair of elements p and q, if index(p) < index(q) in the
original range and p and q are within the same subrange, then index(p)
< index(q) still holds true
- yield division point in amortized linear time on recursive traversal
Soft requirements:
- distance(begin1, end1), distance(begin2, end2) should be close to a
50:50 split, as achievable within the time complexity requirement
- the sub-ranges shall avoid data sharing with each other as much as
possible
- data-locality shall be pursued within each sub-range
Recommended implementation details:
- For trees, each sub-range shall represent a sub-tree with a height
lower than the former lowest common ancestor of the input range
- For bucket based containers, each sub-range shall always be located
at the boundary of a bucket, except when the range was already
intersection a bucket
---
There is yet another utility in this context which would aid working
with container types of unknown complexity, but not a strict
prerequisite for the above proposal:
A heuristic alternative to "std::distance". The exact form we already
have does not have an efficient implementation for anything except
dense, compact allocations where it can be reduced to pointer
arithmetic.
Except in many use cases an exact measurement isn't required at all,
but a guaranteed estimate for an upper bound is plenty to decide
between different traversal strategies. Most STL container
implementations, once again except for pure linked lists, can provide
good estimates in either logarithmic or even constant time.
E.g. for a binary tree, a good estimate can be estimated in log(n)
just based on the tree height of the common subtree containing two
arbitrary nodes.
For hash maps it's even easier, as the sparse properties of the
backing storage can be ignored and the maximum distance is pure
pointer arithmetic.
It's still up to discuss whether a container should make a "guess"
even if the upper bound can't be guaranteed, only estimated. Think
e.g. about skip list structures, which can make a good guess based on
the number of highest tier links in between two nodes, but that
estimate may be fundamentally off in case the data structure has
become unbalanced.
In conjunction with the above proposal, it serves as the foundation
for the "par_unseq" / "par_seq" family of algorithms to make an
educated decision whether to keep distributing horizontally or to
begin processing a given range.