Date: Sun, 06 Sep 2020 12:20:11 -0500

On Sun, Sep 6, 2020, at 08:13, Arthur O'Dwyer via Std-Proposals wrote:

> On Sun, Sep 6, 2020 at 2:55 AM Andreas Ringlstetter via Std-Proposals <std-proposals_at_[hidden]> wrote:

>> Follow up on https://stackoverflow.com/questions/62306580/balanced-partition-of-iterator-range-without-legacyrandomaccessiterator

>

> Interesting problem and relevant motivation!

>

> On SO, darune writes:

> > All associate containers are already inherently sorted and already provides specialized accessors and api's.

>

> But this isn't really true for your use-case, is it? I mean, suppose I have a std::set. It is a balanced binary tree, but (unless I'm missing something) there is no specialized accessor that would give me an iterator to the "midpoint/root" element in anything less than O(n) time. And a very slow and pointer-chasey O(n) at that!

> So (unless I'm missing something) darune's dismissal is a bit premature.

>

> However, I also don't see any clean way to implement what you're asking for. Consider what the compiler sees for

>

> std::set<int> s = {1,2,3,4,5,6,7,8,9,10,11};

> auto first = s.begin(), last = s.end();

> auto it1 = std2::divide_in_half(first, last);

> assert(*it1 == 6);

> std::advance(first, 2); std::advance(last, -4);

> auto it2 = std2::divide_in_half(first, last);

> assert(*it2 == 5);

As I read the original motivation, I don't think the above is what is really desired. Divide_in_half should divide in half (with off by one errors in cases that should be careful considered) and so should have the property above no matter how much time it take.

What I think is desired (but not as written) is find a good middle. Maybe it is the root node, but maybe it is some node near the middle that also happens to sit on a cpu cache line division such that you can run a parallel iteration algorithm on the two halves in roughly O(n/2) time without cache line problems . If the data structure is large enough this can save real clock time even though typically in analysis we ignor constant factors.

Thus what seems to be desired is a find_a_good_midpoint where the wording carefully allows the implementation to pick anything near the midpoint. In the case of vector it could be the edge of a cache line nearest the midpoint (though the standard wouldn't say that and so a conforming implementation could just be the true midpoint) . In the case of set it would be the lowest common root of the tree. (even if one side is vastly larger than the other) in this case the whole concept of off by one errors doesn't make sense since we are allowing off by much more than one.

Not that find_a_good_midpoint probably should send some sort of error if the range is small enough that you wouldn't use a parallel algorithm. Again implementation defined, or maybe a different function to ask if the range is worth splitting (I'd like this later to account for how busy the other cpus are but that seems to require solving the halting problem for what the other cpus are doing)

It goes without saying that when doing the above you are aware of iterator invalidations.

> The iterators `first` and `last` don't know anything about the container that they're part of. They don't have any way to get at the "root" node of that tree, except to traverse over to it. (It is true AFAIK that an iterator would be able to tell **whether** it were pointing directly at the root node, but that doesn't help us here AFAIK.)

> And the type-system doesn't see any difference between the first call to `std2::divide_in_half` there and the second call. The first one we'd like to quickly return the root node; the second one we do **not **want to return the root node, and must do the slow O(n) traversal.

>

> NOTE: For the purposes of this thread, I'd like to completely handwave away the practical difference between "the root node of the red-black tree" and "the n/2th node in sorted order." I'd like to just assume that we can solve any potential off-by-1 or off-by-2 errors that might arise, and not go down that nerdsnipe rabbit hole at all

I think that is fair. The motivation doesn't seem to be about mathematically mid points which such questions make sense.

> [...]

>> If they provide only the `LegacyForwardIterator` trait, only thing we

>> can do in constant time is:

>>

>> std::set<int> foo;

>> ... // fill it

>> auto begin = foo.begin();

>> auto end = foo.end();

>> auto middle = (begin != end) ? ++(begin) : begin;

>>

>> No chance of processing the iterator range in an even remotely

>> balanced way without a linear scan.

>

> FWIW, `std::for_each` is allowed to take O(n) time to do its job. If it spends 1/Kth of that time doing an O(n) traversal of the list to find its actual midpoint, then that's fine, AFAIK. There's nothing in the formal standard AFAIK that says it **must** use a dumb O(1) algorithm to partition the work. If you're seeing that dumb algorithm being used in implementations today, I think that's likely for software-engineering reasons, not standard-conformance reasons.

>

> [...]

>> For reference, the following is about on-par with what the existing

>> STL implementations do:

>>

>> // Generic implementation

>> template<typename Iter, ~~~>

>> std::pair<std::pair<Iter, Iter>, std::pair<Iter, Iter>>

>> divide(const Iter &begin, const Iter &end)

>> {

>> if constexpr (std::is_convertible_v<typename

>> std::iterator_traits<Iter>::iterator_category,

>> std::random_access_iterator_tag>)

>> {

>> // Ideal bisection

>> const Iter middle = end - std::distance(begin, end) / 2;

>> return { {begin, middle}, {begin, middle} };

>

> Here and below, that should be {{begin, middle}, {middle, end}}, right?

>

> Finally, I think you have at least two different problems here:

> (1) Bisect (trisect,...) a range into contiguous subranges of the same iterator-type and **exactly** the right sizes. I think range-v3 would be capable of this task; something with std::ranges::size and then repeatedly views::slice. Someone better at Ranges than me might try it.

> (2) Given a range, split it "quick and dirty" into subranges of possibly different iterator-type, and possibly not exactly the right sizes, but close enough to get speedup on parallel tasks.

>

> You mention that unordered_set would probably be very good at #2 — but horrible at #1.

> Another kind of data structure that might be good at #2 would be skiplists, whose iterators would know how to take "chunks" easily.

>

> [...]

>> A heuristic alternative to "std::distance". The exact form we already

>> have does not have an efficient implementation for anything except

>> dense, compact allocations where it can be reduced to pointer

>> arithmetic.

>

> Pedantry: No, std::distance uses iterator arithmetic, so it works efficiently on any random-access iterator type. std::distance of std::deque iterators is constant-time, for example, even though deque is not contiguous.

> C++20 adds "contiguous iterator" as a refinement of "random-access iterator."

>

> –Arthur

> --

> Std-Proposals mailing list

> Std-Proposals_at_[hidden]

> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

>

> On Sun, Sep 6, 2020 at 2:55 AM Andreas Ringlstetter via Std-Proposals <std-proposals_at_[hidden]> wrote:

>> Follow up on https://stackoverflow.com/questions/62306580/balanced-partition-of-iterator-range-without-legacyrandomaccessiterator

>

> Interesting problem and relevant motivation!

>

> On SO, darune writes:

> > All associate containers are already inherently sorted and already provides specialized accessors and api's.

>

> But this isn't really true for your use-case, is it? I mean, suppose I have a std::set. It is a balanced binary tree, but (unless I'm missing something) there is no specialized accessor that would give me an iterator to the "midpoint/root" element in anything less than O(n) time. And a very slow and pointer-chasey O(n) at that!

> So (unless I'm missing something) darune's dismissal is a bit premature.

>

> However, I also don't see any clean way to implement what you're asking for. Consider what the compiler sees for

>

> std::set<int> s = {1,2,3,4,5,6,7,8,9,10,11};

> auto first = s.begin(), last = s.end();

> auto it1 = std2::divide_in_half(first, last);

> assert(*it1 == 6);

> std::advance(first, 2); std::advance(last, -4);

> auto it2 = std2::divide_in_half(first, last);

> assert(*it2 == 5);

As I read the original motivation, I don't think the above is what is really desired. Divide_in_half should divide in half (with off by one errors in cases that should be careful considered) and so should have the property above no matter how much time it take.

What I think is desired (but not as written) is find a good middle. Maybe it is the root node, but maybe it is some node near the middle that also happens to sit on a cpu cache line division such that you can run a parallel iteration algorithm on the two halves in roughly O(n/2) time without cache line problems . If the data structure is large enough this can save real clock time even though typically in analysis we ignor constant factors.

Thus what seems to be desired is a find_a_good_midpoint where the wording carefully allows the implementation to pick anything near the midpoint. In the case of vector it could be the edge of a cache line nearest the midpoint (though the standard wouldn't say that and so a conforming implementation could just be the true midpoint) . In the case of set it would be the lowest common root of the tree. (even if one side is vastly larger than the other) in this case the whole concept of off by one errors doesn't make sense since we are allowing off by much more than one.

Not that find_a_good_midpoint probably should send some sort of error if the range is small enough that you wouldn't use a parallel algorithm. Again implementation defined, or maybe a different function to ask if the range is worth splitting (I'd like this later to account for how busy the other cpus are but that seems to require solving the halting problem for what the other cpus are doing)

It goes without saying that when doing the above you are aware of iterator invalidations.

> The iterators `first` and `last` don't know anything about the container that they're part of. They don't have any way to get at the "root" node of that tree, except to traverse over to it. (It is true AFAIK that an iterator would be able to tell **whether** it were pointing directly at the root node, but that doesn't help us here AFAIK.)

> And the type-system doesn't see any difference between the first call to `std2::divide_in_half` there and the second call. The first one we'd like to quickly return the root node; the second one we do **not **want to return the root node, and must do the slow O(n) traversal.

>

> NOTE: For the purposes of this thread, I'd like to completely handwave away the practical difference between "the root node of the red-black tree" and "the n/2th node in sorted order." I'd like to just assume that we can solve any potential off-by-1 or off-by-2 errors that might arise, and not go down that nerdsnipe rabbit hole at all

I think that is fair. The motivation doesn't seem to be about mathematically mid points which such questions make sense.

> [...]

>> If they provide only the `LegacyForwardIterator` trait, only thing we

>> can do in constant time is:

>>

>> std::set<int> foo;

>> ... // fill it

>> auto begin = foo.begin();

>> auto end = foo.end();

>> auto middle = (begin != end) ? ++(begin) : begin;

>>

>> No chance of processing the iterator range in an even remotely

>> balanced way without a linear scan.

>

> FWIW, `std::for_each` is allowed to take O(n) time to do its job. If it spends 1/Kth of that time doing an O(n) traversal of the list to find its actual midpoint, then that's fine, AFAIK. There's nothing in the formal standard AFAIK that says it **must** use a dumb O(1) algorithm to partition the work. If you're seeing that dumb algorithm being used in implementations today, I think that's likely for software-engineering reasons, not standard-conformance reasons.

>

> [...]

>> For reference, the following is about on-par with what the existing

>> STL implementations do:

>>

>> // Generic implementation

>> template<typename Iter, ~~~>

>> std::pair<std::pair<Iter, Iter>, std::pair<Iter, Iter>>

>> divide(const Iter &begin, const Iter &end)

>> {

>> if constexpr (std::is_convertible_v<typename

>> std::iterator_traits<Iter>::iterator_category,

>> std::random_access_iterator_tag>)

>> {

>> // Ideal bisection

>> const Iter middle = end - std::distance(begin, end) / 2;

>> return { {begin, middle}, {begin, middle} };

>

> Here and below, that should be {{begin, middle}, {middle, end}}, right?

>

> Finally, I think you have at least two different problems here:

> (1) Bisect (trisect,...) a range into contiguous subranges of the same iterator-type and **exactly** the right sizes. I think range-v3 would be capable of this task; something with std::ranges::size and then repeatedly views::slice. Someone better at Ranges than me might try it.

> (2) Given a range, split it "quick and dirty" into subranges of possibly different iterator-type, and possibly not exactly the right sizes, but close enough to get speedup on parallel tasks.

>

> You mention that unordered_set would probably be very good at #2 — but horrible at #1.

> Another kind of data structure that might be good at #2 would be skiplists, whose iterators would know how to take "chunks" easily.

>

> [...]

>> A heuristic alternative to "std::distance". The exact form we already

>> have does not have an efficient implementation for anything except

>> dense, compact allocations where it can be reduced to pointer

>> arithmetic.

>

> Pedantry: No, std::distance uses iterator arithmetic, so it works efficiently on any random-access iterator type. std::distance of std::deque iterators is constant-time, for example, even though deque is not contiguous.

> C++20 adds "contiguous iterator" as a refinement of "random-access iterator."

>

> –Arthur

> --

> Std-Proposals mailing list

> Std-Proposals_at_[hidden]

> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

>

Received on 2020-09-06 12:24:26