Date: Sun, 6 Sep 2020 17:46:12 +0200

Arthur O'Dwyer:

> [...]

>

> If they provide only the `LegacyForwardIterator` trait, only thing we

> can do in constant time is:

>

> std::set<int> foo;

> ... // fill it

> auto begin = foo.begin();

> auto end = foo.end();

> auto middle = (begin != end) ? ++(begin) : begin;

>

> No chance of processing the iterator range in an even remotely

> balanced way without a linear scan.

>

>

> FWIW, `std::for_each` is allowed to take O(n) time to do its job. If

> it spends 1/Kth of that time doing an O(n) traversal of the list to

> find its actual midpoint, then that's fine, AFAIK. There's nothing in

> the formal standard AFAIK that says it /*must*/ use a dumb O(1)

> algorithm to partition the work. If you're seeing that dumb algorithm

> being used in implementations today, I think that's likely for

> software-engineering reasons, not standard-conformance reasons.

There is no high level division done at all, so far (unless where

trivial). Even if a better approach was used, anything except for

non-STL containers would be locked out regardless, in lack of an interface.

Only approach I've seen yet is that each out of p threads performs a

full linear scan, and then skips each p-1 out of p elements visited.

That's good enough to meet the O(n) requirement, but at an observed

p-fold amplification of system load. Speaking in terms of Ranges, it's

the equivalent of `views::stride`.

For a parallel `std::for_each` with `par_unseq` strategy, the goal is to

get as close to O(n/p) as possible though, not O(n). Doing the

partitioning in O(n) in a single thread first, prior to dispatch, would

certainly improve on the total cost, but still only meets the required

O(n) and not the ideal bound.

> However, I also don't see any clean way to implement what you're

> asking for. Consider what the compiler sees for

>

> std::set<int> s = {1,2,3,4,5,6,7,8,9,10,11};

> auto first = s.begin(), last = s.end();

> auto it1 = std2::divide_in_half(first, last);

> assert(*it1 == 6);

> std::advance(first, 2); std::advance(last, -4);

> auto it2 = std2::divide_in_half(first, last);

> assert(*it2 == 5);

>

> The iterators `first` and `last` don't know anything about the

> container that they're part of. They don't have any way to get at the

> "root" node of that tree, except to traverse over to it. (It is true

> AFAIK that an iterator would be able to tell /*whether*/ it were

> pointing directly at the root node, but that doesn't help us here AFAIK.)

> And the type-system doesn't see any difference between the first call

> to `std2::divide_in_half` there and the second call. The first one

> we'd like to quickly return the root node; the second one we do /*not

> */want to return the root node, and must do the slow O(n) traversal.

It's not possible to assert a "specific" middle point. The only way to

stay fast, is to let that be influenced by implementation details, as

long as the constraints (mostly the "at least 1 element on either side

of the middle point, if possible at all") are not violated.

Just keep in the back of your head that everything is better than a

1:n-1 split, and aiming for a perfect 1:1 split only yields diminishing

returns in terms of resulting time complexity. E.g. improving from a 2:1

to a 1:1 split only reduces run time by 33% (and even that is assuming

no re-balancing by work-stealing), which is easily outweighed by any

constant cost factor.

For the second case, if `first` and `last` end up on opposing sides of

the root node, then the ideal middle point remains the lowest common

parent of `first` and `last`, even if that is the root node again.

For finding the suitable middle point in any tree where a node knows the

height difference to root, the lowest common parent is found in

O(log(n)), single traversal with guaranteed early bail out. If the

absolute height isn't known, then Floyd's algorithm (with implicit

wrap-around on root) remains a O(log(n)) algorithm, albeit at a worse

constant factor and worse average case for "near-siblings".

> NOTE: For the purposes of this thread, I'd like to completely handwave

> away the practical difference between "the root node of the red-black

> tree" and "the n/2th node in sorted order." I'd like to just assume

> that we can solve any potential off-by-1 or off-by-2 errors that might

> arise, and not go down that nerdsnipe rabbit hole at all.

We can't even solve off-by-1 or off-by-2 errors for most structures

without a full scan. Respectively the errors are likely much bigger than

that. For trees we can only do it at all if we would know the precise

fill pattern of the leaf layer. It becomes entirely unfeasible for

sparse data structures with unknown fill levels, such has hash maps. For

skip lists, we can likewise only estimate.

There is also the issue that some degenerated container instances (e.g.

a sparse, almost empty hash map with elements only close to the end of

the backing storage) can still be tricked into an O(n) cost for the

first division attempt. Even though if the same containers was close to

capacity, that would have succeeded in average constant time.

To sum it up, there is no good solution for a perfectly balanced

division, and we can not guarantee it, but we don't need it to be

perfect either as the average case can be sped up by magnitudes regardless.

Balanced trees: From Θ(n) to O(n/p + log(n*p))

Sparse structures: From Θ(n) to O(n) with average case approaching

Θ(n/p + log(p)), but n is capacity...

Random access structures: Unchanged Θ(n/p)

Expected speedups for the non-parallel use cases (such as lower bound

scans in sorted collections of any form) look quite similar.

>

> [...]

>

> For reference, the following is about on-par with what the existing

> STL implementations do:

>

> // Generic implementation

> template<typename Iter, ~~~>

> std::pair<std::pair<Iter, Iter>, std::pair<Iter, Iter>>

> divide(const Iter &begin, const Iter &end)

> {

> if constexpr (std::is_convertible_v<typename

> std::iterator_traits<Iter>::iterator_category,

> std::random_access_iterator_tag>)

> {

> // Ideal bisection

> const Iter middle = end - std::distance(begin, end) / 2;

> return { {begin, middle}, {begin, middle} };

>

>

> Here and below, that should be {{begin, middle}, {middle, end}}, right?

Correct, that was a typo.

>

> Finally, I think you have at least two different problems here:

> (1) [...] I think range-v3 would be capable of this task; something

> with std::ranges::size and then repeatedly views::slice. Someone

> better at Ranges than me might try it.

There is no `std::ranges::size` for range types which couldn't yield it

in constant time, and `views::slice` incurs a full iteration over the

skipped iterators if not random-access.

You are correct that this is the ideal solution if he input range is

random-access.

> Pedantry: No, std::distance uses iterator arithmetic, so it works

> efficiently on any random-access iterator type. std::distance of

> std::deque iterators is constant-time, for example, even though deque

> is not contiguous.

> C++20 adds "contiguous iterator" as a refinement of "random-access

> iterator."

Thank you, I didn't realize that there were actually random-access

containers which are not contiguous. I will try not to mix these terms

up further.

> [...]

>

> If they provide only the `LegacyForwardIterator` trait, only thing we

> can do in constant time is:

>

> std::set<int> foo;

> ... // fill it

> auto begin = foo.begin();

> auto end = foo.end();

> auto middle = (begin != end) ? ++(begin) : begin;

>

> No chance of processing the iterator range in an even remotely

> balanced way without a linear scan.

>

>

> FWIW, `std::for_each` is allowed to take O(n) time to do its job. If

> it spends 1/Kth of that time doing an O(n) traversal of the list to

> find its actual midpoint, then that's fine, AFAIK. There's nothing in

> the formal standard AFAIK that says it /*must*/ use a dumb O(1)

> algorithm to partition the work. If you're seeing that dumb algorithm

> being used in implementations today, I think that's likely for

> software-engineering reasons, not standard-conformance reasons.

There is no high level division done at all, so far (unless where

trivial). Even if a better approach was used, anything except for

non-STL containers would be locked out regardless, in lack of an interface.

Only approach I've seen yet is that each out of p threads performs a

full linear scan, and then skips each p-1 out of p elements visited.

That's good enough to meet the O(n) requirement, but at an observed

p-fold amplification of system load. Speaking in terms of Ranges, it's

the equivalent of `views::stride`.

For a parallel `std::for_each` with `par_unseq` strategy, the goal is to

get as close to O(n/p) as possible though, not O(n). Doing the

partitioning in O(n) in a single thread first, prior to dispatch, would

certainly improve on the total cost, but still only meets the required

O(n) and not the ideal bound.

> However, I also don't see any clean way to implement what you're

> asking for. Consider what the compiler sees for

>

> std::set<int> s = {1,2,3,4,5,6,7,8,9,10,11};

> auto first = s.begin(), last = s.end();

> auto it1 = std2::divide_in_half(first, last);

> assert(*it1 == 6);

> std::advance(first, 2); std::advance(last, -4);

> auto it2 = std2::divide_in_half(first, last);

> assert(*it2 == 5);

>

> The iterators `first` and `last` don't know anything about the

> container that they're part of. They don't have any way to get at the

> "root" node of that tree, except to traverse over to it. (It is true

> AFAIK that an iterator would be able to tell /*whether*/ it were

> pointing directly at the root node, but that doesn't help us here AFAIK.)

> And the type-system doesn't see any difference between the first call

> to `std2::divide_in_half` there and the second call. The first one

> we'd like to quickly return the root node; the second one we do /*not

> */want to return the root node, and must do the slow O(n) traversal.

It's not possible to assert a "specific" middle point. The only way to

stay fast, is to let that be influenced by implementation details, as

long as the constraints (mostly the "at least 1 element on either side

of the middle point, if possible at all") are not violated.

Just keep in the back of your head that everything is better than a

1:n-1 split, and aiming for a perfect 1:1 split only yields diminishing

returns in terms of resulting time complexity. E.g. improving from a 2:1

to a 1:1 split only reduces run time by 33% (and even that is assuming

no re-balancing by work-stealing), which is easily outweighed by any

constant cost factor.

For the second case, if `first` and `last` end up on opposing sides of

the root node, then the ideal middle point remains the lowest common

parent of `first` and `last`, even if that is the root node again.

For finding the suitable middle point in any tree where a node knows the

height difference to root, the lowest common parent is found in

O(log(n)), single traversal with guaranteed early bail out. If the

absolute height isn't known, then Floyd's algorithm (with implicit

wrap-around on root) remains a O(log(n)) algorithm, albeit at a worse

constant factor and worse average case for "near-siblings".

> NOTE: For the purposes of this thread, I'd like to completely handwave

> away the practical difference between "the root node of the red-black

> tree" and "the n/2th node in sorted order." I'd like to just assume

> that we can solve any potential off-by-1 or off-by-2 errors that might

> arise, and not go down that nerdsnipe rabbit hole at all.

We can't even solve off-by-1 or off-by-2 errors for most structures

without a full scan. Respectively the errors are likely much bigger than

that. For trees we can only do it at all if we would know the precise

fill pattern of the leaf layer. It becomes entirely unfeasible for

sparse data structures with unknown fill levels, such has hash maps. For

skip lists, we can likewise only estimate.

There is also the issue that some degenerated container instances (e.g.

a sparse, almost empty hash map with elements only close to the end of

the backing storage) can still be tricked into an O(n) cost for the

first division attempt. Even though if the same containers was close to

capacity, that would have succeeded in average constant time.

To sum it up, there is no good solution for a perfectly balanced

division, and we can not guarantee it, but we don't need it to be

perfect either as the average case can be sped up by magnitudes regardless.

Balanced trees: From Θ(n) to O(n/p + log(n*p))

Sparse structures: From Θ(n) to O(n) with average case approaching

Θ(n/p + log(p)), but n is capacity...

Random access structures: Unchanged Θ(n/p)

Expected speedups for the non-parallel use cases (such as lower bound

scans in sorted collections of any form) look quite similar.

>

> [...]

>

> For reference, the following is about on-par with what the existing

> STL implementations do:

>

> // Generic implementation

> template<typename Iter, ~~~>

> std::pair<std::pair<Iter, Iter>, std::pair<Iter, Iter>>

> divide(const Iter &begin, const Iter &end)

> {

> if constexpr (std::is_convertible_v<typename

> std::iterator_traits<Iter>::iterator_category,

> std::random_access_iterator_tag>)

> {

> // Ideal bisection

> const Iter middle = end - std::distance(begin, end) / 2;

> return { {begin, middle}, {begin, middle} };

>

>

> Here and below, that should be {{begin, middle}, {middle, end}}, right?

Correct, that was a typo.

>

> Finally, I think you have at least two different problems here:

> (1) [...] I think range-v3 would be capable of this task; something

> with std::ranges::size and then repeatedly views::slice. Someone

> better at Ranges than me might try it.

There is no `std::ranges::size` for range types which couldn't yield it

in constant time, and `views::slice` incurs a full iteration over the

skipped iterators if not random-access.

You are correct that this is the ideal solution if he input range is

random-access.

> Pedantry: No, std::distance uses iterator arithmetic, so it works

> efficiently on any random-access iterator type. std::distance of

> std::deque iterators is constant-time, for example, even though deque

> is not contiguous.

> C++20 adds "contiguous iterator" as a refinement of "random-access

> iterator."

Thank you, I didn't realize that there were actually random-access

containers which are not contiguous. I will try not to mix these terms

up further.

Received on 2020-09-06 10:49:45