Date: Sun, 6 Sep 2020 08:55:29 +0200
Follow up on
https://stackoverflow.com/questions/62306580/balanced-partition-of-iterator-range-without-legacyrandomaccessiterator
Let's start with a simple example:
std::vector<int> foo;
... // fill it
auto begin = foo.begin();
auto end = foo.end();
auto middle = begin + std::distance(begin, end) / 2;
If the iterators have the `LegacyRandomAccessIterator` trait, ideal
bi-partition of the iterator range is embarrassingly easy, and so is
therefore any form of divide-and-conquer type algorithm. This pattern
is commonly found in the various STL implementations too.
If they provide only the `LegacyForwardIterator` trait, only thing we
can do in constant time is:
std::set<int> foo;
... // fill it
auto begin = foo.begin();
auto end = foo.end();
auto middle = (begin != end) ? ++(begin) : begin;
No chance of processing the iterator range in an even remotely
balanced way without a linear scan.
It's easy to understand that for most containers not providing
`LegacyRandomAccessIterator`, an ideal partition scheme can't be for
free. However, most containers would still be able to provide a
partitioning scheme better than `1:n-1` on average at a constant cost.
Any container based on a self balancing tree would be trivially able
to guarantee a worst-case ratio based on intrinsic implementation
details. Any hash table - be it bucket or multiple round based - still
has a high chance of providing a good partition, just by slicing the
hash table itself in half.
Only imbalanced trees or plain lists are by design unsuitable.
This deficit also shows in STL internal algorithms depending on a
somewhat balanced division, e.g. like
`std::for_each(std::execution::par_unseq, ...)`. Even though most STL
containers in about all implementations have implementation details
which allow constant time partition into remotely balanced chunks, I
have yet to see any STL implementation handling anything except
`LegacyRandomAccessIterator` better than pure single element
round-robin. Commonly resulting in false-sharing of cache lines, and
synchronization overhead far beyond any practical value.
This begs the question, how come that the C++ language provides no
interface to request a better division scheme for any STL containers?
https://stackoverflow.com/questions/62306580/balanced-partition-of-iterator-range-without-legacyrandomaccessiterator
Let's start with a simple example:
std::vector<int> foo;
... // fill it
auto begin = foo.begin();
auto end = foo.end();
auto middle = begin + std::distance(begin, end) / 2;
If the iterators have the `LegacyRandomAccessIterator` trait, ideal
bi-partition of the iterator range is embarrassingly easy, and so is
therefore any form of divide-and-conquer type algorithm. This pattern
is commonly found in the various STL implementations too.
If they provide only the `LegacyForwardIterator` trait, only thing we
can do in constant time is:
std::set<int> foo;
... // fill it
auto begin = foo.begin();
auto end = foo.end();
auto middle = (begin != end) ? ++(begin) : begin;
No chance of processing the iterator range in an even remotely
balanced way without a linear scan.
It's easy to understand that for most containers not providing
`LegacyRandomAccessIterator`, an ideal partition scheme can't be for
free. However, most containers would still be able to provide a
partitioning scheme better than `1:n-1` on average at a constant cost.
Any container based on a self balancing tree would be trivially able
to guarantee a worst-case ratio based on intrinsic implementation
details. Any hash table - be it bucket or multiple round based - still
has a high chance of providing a good partition, just by slicing the
hash table itself in half.
Only imbalanced trees or plain lists are by design unsuitable.
This deficit also shows in STL internal algorithms depending on a
somewhat balanced division, e.g. like
`std::for_each(std::execution::par_unseq, ...)`. Even though most STL
containers in about all implementations have implementation details
which allow constant time partition into remotely balanced chunks, I
have yet to see any STL implementation handling anything except
`LegacyRandomAccessIterator` better than pure single element
round-robin. Commonly resulting in false-sharing of cache lines, and
synchronization overhead far beyond any practical value.
This begs the question, how come that the C++ language provides no
interface to request a better division scheme for any STL containers?
--- For me that issue actually popped up when trying to utilize parallel algorithms from C++17 for element-wise bulk processing in an `std::set` container, and it turned out to perform significantly worse for anything not random-access, due the overhead from the lack of any divide-and-conquer friendly iteration scheme. Without a proper division strategy, there are only two possible implementations, either each thread only processes each n-th element it has visited with a thread local iterator copy, or all threads use a synchronization primitive and share the same iterator. First option burns time on iteration, second suffers from contention. First has the same lower bound as single-threaded no-op iteration, second even scales negatively with additional threads due to contention related issues. --- This is not something STL implementations can tackle individually either. While it's certainly possible for the individual implementations to implement a better division strategy for their private container implementations, this would still be lacking an interface for good integration of STL algorithms with non-STL containers. --- For reference, the following is about on-par with what the existing STL implementations do: // Generic implementation template< typename Iter, std::enable_if_t<std::is_convertible_v<typename std::iterator_traits<Iter>::iterator_category, std::forward_iterator_tag>, int> = 0 > std::pair<std::pair<Iter, Iter>, std::pair<Iter, Iter>> divide(const Iter &begin, const Iter &end) { if constexpr (std::is_convertible_v<typename std::iterator_traits<Iter>::iterator_category, std::random_access_iterator_tag>) { // Ideal bisection const Iter middle = end - std::distance(begin, end) / 2; return { {begin, middle}, {begin, middle} }; } else { // Not ideal, but constant time auto middle = begin; if (middle != end) { ++middle; } return { {begin, middle}, {begin, middle} }; } } It's already extending on existing STL implementations in the sense that it does allow choosing not only a common middle point, but if better suited 2 arbitrary sequences of incompatible iterator types utilizing e.g. the ranges library or any other arbitrary implementation details. The motivation for that is to cater for use cases where e.g. interleaved traversal is better suiting the requirements on data locality. If that is not deemed worth supporting, then only returning the approximately ideal middle point would suffice. Another point to explore is whether an additional "strategy" shall be added as a parameter. Possible options for this parameter would be "divide", requesting an ideal bisection, "chunk", requesting "reasonable" work packages to be sliced from the start of the range (e.g. to be called in a round-robin fashion by multiple worker threads), or "interleaved" to request a division pattern best suited for "par_seq" use cases. If this option is discarded, then the only recommendable strategy is bisection. If the requested strategy is not applicable for the container, then "division" is the only reasonably fall back choice. --- An attempt to flesh out the actual requirements: Hard requirements for "{begin1, end1}, {begin2, end2} = divide(begin, end)": - if begin != end, then begin1 != end1 - it begin == end, then begin1 == end1 and begin2 == end2 - each element q in the original range is in exactly one of the subranges - for each pair of elements p and q, if index(p) < index(q) in the original range and p and q are within the same subrange, then index(p) < index(q) still holds true - yield division point in amortized linear time on recursive traversal Soft requirements: - distance(begin1, end1), distance(begin2, end2) should be close to a 50:50 split, as achievable within the time complexity requirement - the sub-ranges shall avoid data sharing with each other as much as possible - data-locality shall be pursued within each sub-range Recommended implementation details: - For trees, each sub-range shall represent a sub-tree with a height lower than the former lowest common ancestor of the input range - For bucket based containers, each sub-range shall always be located at the boundary of a bucket, except when the range was already intersection a bucket --- There is yet another utility in this context which would aid working with container types of unknown complexity, but not a strict prerequisite for the above proposal: A heuristic alternative to "std::distance". The exact form we already have does not have an efficient implementation for anything except dense, compact allocations where it can be reduced to pointer arithmetic. Except in many use cases an exact measurement isn't required at all, but a guaranteed estimate for an upper bound is plenty to decide between different traversal strategies. Most STL container implementations, once again except for pure linked lists, can provide good estimates in either logarithmic or even constant time. E.g. for a binary tree, a good estimate can be estimated in log(n) just based on the tree height of the common subtree containing two arbitrary nodes. For hash maps it's even easier, as the sparse properties of the backing storage can be ignored and the maximum distance is pure pointer arithmetic. It's still up to discuss whether a container should make a "guess" even if the upper bound can't be guaranteed, only estimated. Think e.g. about skip list structures, which can make a good guess based on the number of highest tier links in between two nodes, but that estimate may be fundamentally off in case the data structure has become unbalanced. In conjunction with the above proposal, it serves as the foundation for the "par_unseq" / "par_seq" family of algorithms to make an educated decision whether to keep distributing horizontally or to begin processing a given range.
Received on 2020-09-06 01:59:11