Date: Thu, 25 Jan 2024 20:45:16 +0000
Sorry, Phil, I believe I have misled you into conflating the proposal and
the specific use case which I had when I came up with it. Let me unpack
those two:
1. One-sided binary search gives std::lower_bound() better worst-case
asymptotic algorithmic complexity for non-random-access iterators:
2*log(N)-1 comparisons and 1.5*N iterator increments versus the classic
algorithm's log(N) comparisons and 2*N iterator increments. (I've slipped
in a correction here, I was previously saying that one-sided binary search
gave us N increments, but that was wrong: it's at most N increments looking
for the correct range, and then at most N/2 running classic binary search
in that range). It also gives us a much better lower bound on complexity:
classic binary search runs the same number of steps no matter the input,
but one-sided binary search will stop after a single step if our needle is
right at the beginning of the range.
2. When used inside std::set_intersection(), one-sided binary search does
in the worst case 2*N-1 comparisons and 3*N iterator increments, versus the
2*N-1 comparisons and 2*N iterator increments that the classic
implementation does, so they are both linear, with better constants for the
classic implementation. However, the lower bound for the classic algorithm
is N comparisons and N iterator increments, versus log(N) comparisons with
one-sided binary search, and, in terms of iterator increments, log(N) for
random-access iterators, or 1.5*N for all others. And then the interesting
question is: what are the conditions to trigger the worst case, can we
expect good cases to be common? The answer is that the good cases are the
ones where we can benefit from binary search. If your input is such that we
keep alternating which range contains the next value in order, then all the
machinery for one-sided binary search is useless, because the best you can
do is scan both ranges sequentially. It's when we have longer ranges in one
input without intervening values from the other that we can realize the
gains. So it's not specifically about the sizes of the input ranges, but
about where sequence ordering flips from one input to the other. Obviously,
if one of the ranges has less items there is less opportunity for
switching, and that gives us an idea of the subsets of the domain of inputs
where the version using one-sided binary search will do badly: it's only
when input sizes are of the same order of magnitude, and there are few
ranges on one which don't contain values from the other. That's pretty
constrained. In all other cases, the version using one-sided binary search
will win. And it will win by a lot more when we have random-access
iterators: for a large portion of the input domain, we've made
set_intersection wholly sublinear! For std::vector! That's no hack, right?
I hope I haven't rambled too much in this long email, let me know what you
think.
I need a bit more time to share the first draft of the paper, I was playing
with my std::set_intersection() benchmarks to add results to it, but came
to the conclusion that benchmarks for std::lower_bound() on its own would
be more relevant. The reason is that the natural thing to do, with
std::set_intersection(), is to compare the performance of the version using
one-sided binary search against the classic implementation, which uses
linear search, and, although I get very positive results, that's not at all
an apples-to-apples comparison. We want to know the difference between
one-sided binary search and classic binary search, right? The
counter-argument is that it's an inconceivably bad idea to use classic
binary search in std::set_intersection() -- and that's pretty meaningful,
right? This is more than your average performance increase, it increases
the range of applications for std::lower_bound(). At this point I think
that perhaps showing benchmark results comparing to classic binary search
and then discussing what I've just described may be the better way to
present it, but I'm still looking into it.
On Tue, 23 Jan 2024 at 19:28, Phil Endecott via Std-Proposals <
std-proposals_at_[hidden]> wrote:
> Arthur O'Dwyer wrote:
> > On Mon, Jan 22, 2024 at 11:03 AM Phil Endecott via Std-Proposals <
> > std-proposals_at_[hidden]> wrote:
> >> Ugh, I feel a bit queasy with the idea of doing a binary search through
> a
> >> std::set without using its (private) left/right/parent pointers to do
> so.
> >> Slightly more queasy than finding the size of a std::list without using
> its
> >> size() member. Here's an idea, add a std::set::lower_bound overload that
> >> takes an iterator pair (or a single begin iterator); maybe call it a
> hint,
> >> like std::set::insert?
> >
> > Off the top of my head, I think that might not be efficiently
> > implementable. IIRC, the library's red-black trees don't have *parent*
> > pointers; they have *next/prev* pointers (for iterator traversal).
>
> I believe you have remembered that wrongly.
>
> Thinking more about this, I believe what's really needed is a
> std::set-specific
> set-intersection algorithm; either a member of std::set or a friend
> function,
> maybe even a friend overload of std::(ranges?)::set_intersection. Using the
> tree structure of the set is the right way to do this algorithmically.
> Everything else is a hack to work around the fact that the tree
> structure is
> private.
>
> Sketch of set intersection of two binary trees:
>
> void set_intersection(const tree& a, const tree& b, tree-or-other& out)
> {
> if (a.empty() || b.empty()) return;
>
> // if (a.back() < b.front() || b.back() < a.front()) return;
>
> if (a.root() < b.root()) {
> set_intersection(a, b.left_subtree(), out);
> set_intersection(a.right_subtree(), b.right_subtree(), out);
>
> } else if (a.root() == b.root()) {
> set_intersection(a.left_subtree(), b.left_subtree(), out);
> out.insert_at_end(a.root());
> set_intersection(a.right_subtree(), b.right_subtree(), out);
>
> } else { // a.root() > b.root()
> set_intersection(a.left_subtree(), b, out);
> set_intersection(a.right_subtree(), b.right_subtree(), out);
> }
> }
>
>
> Regards, Phil.
>
>
>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
the specific use case which I had when I came up with it. Let me unpack
those two:
1. One-sided binary search gives std::lower_bound() better worst-case
asymptotic algorithmic complexity for non-random-access iterators:
2*log(N)-1 comparisons and 1.5*N iterator increments versus the classic
algorithm's log(N) comparisons and 2*N iterator increments. (I've slipped
in a correction here, I was previously saying that one-sided binary search
gave us N increments, but that was wrong: it's at most N increments looking
for the correct range, and then at most N/2 running classic binary search
in that range). It also gives us a much better lower bound on complexity:
classic binary search runs the same number of steps no matter the input,
but one-sided binary search will stop after a single step if our needle is
right at the beginning of the range.
2. When used inside std::set_intersection(), one-sided binary search does
in the worst case 2*N-1 comparisons and 3*N iterator increments, versus the
2*N-1 comparisons and 2*N iterator increments that the classic
implementation does, so they are both linear, with better constants for the
classic implementation. However, the lower bound for the classic algorithm
is N comparisons and N iterator increments, versus log(N) comparisons with
one-sided binary search, and, in terms of iterator increments, log(N) for
random-access iterators, or 1.5*N for all others. And then the interesting
question is: what are the conditions to trigger the worst case, can we
expect good cases to be common? The answer is that the good cases are the
ones where we can benefit from binary search. If your input is such that we
keep alternating which range contains the next value in order, then all the
machinery for one-sided binary search is useless, because the best you can
do is scan both ranges sequentially. It's when we have longer ranges in one
input without intervening values from the other that we can realize the
gains. So it's not specifically about the sizes of the input ranges, but
about where sequence ordering flips from one input to the other. Obviously,
if one of the ranges has less items there is less opportunity for
switching, and that gives us an idea of the subsets of the domain of inputs
where the version using one-sided binary search will do badly: it's only
when input sizes are of the same order of magnitude, and there are few
ranges on one which don't contain values from the other. That's pretty
constrained. In all other cases, the version using one-sided binary search
will win. And it will win by a lot more when we have random-access
iterators: for a large portion of the input domain, we've made
set_intersection wholly sublinear! For std::vector! That's no hack, right?
I hope I haven't rambled too much in this long email, let me know what you
think.
I need a bit more time to share the first draft of the paper, I was playing
with my std::set_intersection() benchmarks to add results to it, but came
to the conclusion that benchmarks for std::lower_bound() on its own would
be more relevant. The reason is that the natural thing to do, with
std::set_intersection(), is to compare the performance of the version using
one-sided binary search against the classic implementation, which uses
linear search, and, although I get very positive results, that's not at all
an apples-to-apples comparison. We want to know the difference between
one-sided binary search and classic binary search, right? The
counter-argument is that it's an inconceivably bad idea to use classic
binary search in std::set_intersection() -- and that's pretty meaningful,
right? This is more than your average performance increase, it increases
the range of applications for std::lower_bound(). At this point I think
that perhaps showing benchmark results comparing to classic binary search
and then discussing what I've just described may be the better way to
present it, but I'm still looking into it.
On Tue, 23 Jan 2024 at 19:28, Phil Endecott via Std-Proposals <
std-proposals_at_[hidden]> wrote:
> Arthur O'Dwyer wrote:
> > On Mon, Jan 22, 2024 at 11:03 AM Phil Endecott via Std-Proposals <
> > std-proposals_at_[hidden]> wrote:
> >> Ugh, I feel a bit queasy with the idea of doing a binary search through
> a
> >> std::set without using its (private) left/right/parent pointers to do
> so.
> >> Slightly more queasy than finding the size of a std::list without using
> its
> >> size() member. Here's an idea, add a std::set::lower_bound overload that
> >> takes an iterator pair (or a single begin iterator); maybe call it a
> hint,
> >> like std::set::insert?
> >
> > Off the top of my head, I think that might not be efficiently
> > implementable. IIRC, the library's red-black trees don't have *parent*
> > pointers; they have *next/prev* pointers (for iterator traversal).
>
> I believe you have remembered that wrongly.
>
> Thinking more about this, I believe what's really needed is a
> std::set-specific
> set-intersection algorithm; either a member of std::set or a friend
> function,
> maybe even a friend overload of std::(ranges?)::set_intersection. Using the
> tree structure of the set is the right way to do this algorithmically.
> Everything else is a hack to work around the fact that the tree
> structure is
> private.
>
> Sketch of set intersection of two binary trees:
>
> void set_intersection(const tree& a, const tree& b, tree-or-other& out)
> {
> if (a.empty() || b.empty()) return;
>
> // if (a.back() < b.front() || b.back() < a.front()) return;
>
> if (a.root() < b.root()) {
> set_intersection(a, b.left_subtree(), out);
> set_intersection(a.right_subtree(), b.right_subtree(), out);
>
> } else if (a.root() == b.root()) {
> set_intersection(a.left_subtree(), b.left_subtree(), out);
> out.insert_at_end(a.root());
> set_intersection(a.right_subtree(), b.right_subtree(), out);
>
> } else { // a.root() > b.root()
> set_intersection(a.left_subtree(), b, out);
> set_intersection(a.right_subtree(), b.right_subtree(), out);
> }
> }
>
>
> Regards, Phil.
>
>
>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Received on 2024-01-25 20:45:31