Date: Mon, 22 Jan 2024 14:34:05 +0000
> If both exist they need different names. Classic binary search already
> has a name, so one-sided needs a different name.
I definitely agree with that, my thought was mainly that it's not obvious
to me that one-sided binary search is important enough to merit its place
in the standard library -- I _really_ like it, but there are many other
useful algorithms missing and I can't really say that one-sided binary
search should be prioritised over them. What I thought was obvious enough
to merit a proposal is the fact that the current complexity guarantee for
std::lower_bound() blocks us from improving it for non-random-access
iterators. Having said that, I would support having an explicit one-sided
binary search function in the library.
> But in the case of std::list since C++11 we have constant-time
> size(), which std::lower_bound() doesn't exploit.
Well, std::lower_bound() takes a pair of iterators, so we can't really rely
on size(). And that was very useful to me in the context of the libc++
proposal that mae me start this thread, I'll describe it in more detail
below.
> Doing any sort of binary search in an ordered std::list, or similar,
> is IMHO a somewhat exotic thing to want to do.
I don't disagree with you, std::list sounds pretty exotic for
std::lower_bound()! But I've used it to great effect speeding up the
processing of large numbers of std::set instances, doing something similar
to std::set_intersection() (which, if you look at the libc++ proposal I
linked at the beginning of this thread, you'll find is now the only part
that remains, since it doesn't violate any standard library complexity
guarantees). The classic way of merging ordered sequences is to simply
iterate over both linearly, moving forward the one which compares less so
we can find all matches, achieving the minimal worst-case complexity of 2*n
comparisons. But if you think about it, that's really wasteful: both
sequences are sorted, why can't we use binary search to fast-forward the
iterator which is behind, and perform O(log(n)) comparisons on the
best-case scenario? By narrowing down the iterator range as we move
forward, with one-sided binary search we can preserve the worst-case
scenario complexity while improving the best-case scenario. And the
best-case scenario isn't silly either -- in my specific case, which comes
from fairly popular closed-source software, std::set instance sizes are
often orders of magnitude greater than the sizes of the intersections
between them, and contents aren't interlaced in terms of less-than
comparisons (which, for this algorithm, is as bad as having identical sets
as input). That's not an uncommon real-world scenario in my experience,
operating on the intersection of std::set instances where the intersection
is a fraction of the size of the input containers.
You may be asking yourself "why not use std::set::lower_bound() instead?",
and the answer is that std::set::lower_bound() doesn't take two iterators
as its input, it will start at the root of the tree and work its way down,
so that the worst-case complexity of using it to iterate over both trees
would be O(n*log(n)). The magic of worst-case linear traversal can only
happen with the standalone std::lower_bound(), so that we decrease n as we
move forward, while at the same time exploiting the fact that the number of
steps required by one-sided binary search depends on the distance of the
match, rather than the size of the range.
> I wouldn't object to that specific change, if you had benchmarks showing
> an improvement for some non-contrived use case.
That same pull request I've linked at the beginning of this thread contains
benchmark code for std::set_intersection() using one-sided binary search.
I'm thinking I'll do what Barry suggested and write a short paper
describing the change and benchmark results so that it's more
straightforward to consume. It's been so long since I last wrote a paper,
it's probably good exercise.
On Sat, 20 Jan 2024 at 17:54, Phil Endecott <std_proposals_list_at_[hidden]>
wrote:
> >> Personally I'd prefer something like this to have a name that told me
> >> what algorithm it actually is, e.g. one_sided_lower_bound.
>
> Iúri Chaer wrote:
> > Wouldn't that go counter to the design philosophy of the STL? lower_bound
> > is not the name of an algorithm, but of the action -- the algorithm
> behind
> > it in all the mainline implementations is the classic binary search, but
> > that's not explicit in its name.
>
> I don't want to get into a deep discussion about naming, but my personal
> preference is generally more towards using the accepted names of
> algorithms or data structures that we might find in textbooks, hence
> my use of "personally" in the quote above. Others will differ.
>
> But in the case of your proposal, an important consideration is that
> both classic binary search and one-sided binary search are useful for
> random-access containers, in different situations. If both exist they
> need different names. Classic binary search already has a name, so
> one-sided needs a different name.
>
> Doing any sort of binary search in an ordered std::list, or similar,
> is IMHO a somewhat exotic thing to want to do. Some might be surprised
> to discover that std::lower_bound and especially std::binary_search work
> at all with non-random-access iterators. You could even argue that it's
> dangerous for it to support data structures where its performance is
> linear rather than logarithmic, i.e. it would be better for the user to
> get an error or warning telling them to change their list to a vector,
> or to just do a std::find linear search.
>
> The reduced number of iterator increments that you report (n rather
> than 2n) is due to the classic binary search algorithm needing to
> find the length of the range when it starts. But in the case of
> std::list since C++11 we have constant-time size(), which
> std::lower_bound() doesn't exploit. Does
> std::ranges::lower_bound<std::list> do so? Using this would give n
> iterator mutations and log(n) comparisons, I think. Are there
> particular containers / iterators that you are interested in where
> the size of the range is not available in constant time?
>
>
> > I'd
> > be happy if the wording for the lower_bound guarantees were loosened to
> > allow the use of algorithms with better overall complexity. Perhaps it
> > could say something like "at most log(last - first)+O(1) comparisons for
> > random-access iterators, or O(log(last - first)) comparisons for other
> > iterator types".
>
> I wouldn't object to that specific change, if you had benchmarks showing
> an improvement for some non-contrived use case.
>
>
> Regards, Phil.
>
>
>
>
>
> has a name, so one-sided needs a different name.
I definitely agree with that, my thought was mainly that it's not obvious
to me that one-sided binary search is important enough to merit its place
in the standard library -- I _really_ like it, but there are many other
useful algorithms missing and I can't really say that one-sided binary
search should be prioritised over them. What I thought was obvious enough
to merit a proposal is the fact that the current complexity guarantee for
std::lower_bound() blocks us from improving it for non-random-access
iterators. Having said that, I would support having an explicit one-sided
binary search function in the library.
> But in the case of std::list since C++11 we have constant-time
> size(), which std::lower_bound() doesn't exploit.
Well, std::lower_bound() takes a pair of iterators, so we can't really rely
on size(). And that was very useful to me in the context of the libc++
proposal that mae me start this thread, I'll describe it in more detail
below.
> Doing any sort of binary search in an ordered std::list, or similar,
> is IMHO a somewhat exotic thing to want to do.
I don't disagree with you, std::list sounds pretty exotic for
std::lower_bound()! But I've used it to great effect speeding up the
processing of large numbers of std::set instances, doing something similar
to std::set_intersection() (which, if you look at the libc++ proposal I
linked at the beginning of this thread, you'll find is now the only part
that remains, since it doesn't violate any standard library complexity
guarantees). The classic way of merging ordered sequences is to simply
iterate over both linearly, moving forward the one which compares less so
we can find all matches, achieving the minimal worst-case complexity of 2*n
comparisons. But if you think about it, that's really wasteful: both
sequences are sorted, why can't we use binary search to fast-forward the
iterator which is behind, and perform O(log(n)) comparisons on the
best-case scenario? By narrowing down the iterator range as we move
forward, with one-sided binary search we can preserve the worst-case
scenario complexity while improving the best-case scenario. And the
best-case scenario isn't silly either -- in my specific case, which comes
from fairly popular closed-source software, std::set instance sizes are
often orders of magnitude greater than the sizes of the intersections
between them, and contents aren't interlaced in terms of less-than
comparisons (which, for this algorithm, is as bad as having identical sets
as input). That's not an uncommon real-world scenario in my experience,
operating on the intersection of std::set instances where the intersection
is a fraction of the size of the input containers.
You may be asking yourself "why not use std::set::lower_bound() instead?",
and the answer is that std::set::lower_bound() doesn't take two iterators
as its input, it will start at the root of the tree and work its way down,
so that the worst-case complexity of using it to iterate over both trees
would be O(n*log(n)). The magic of worst-case linear traversal can only
happen with the standalone std::lower_bound(), so that we decrease n as we
move forward, while at the same time exploiting the fact that the number of
steps required by one-sided binary search depends on the distance of the
match, rather than the size of the range.
> I wouldn't object to that specific change, if you had benchmarks showing
> an improvement for some non-contrived use case.
That same pull request I've linked at the beginning of this thread contains
benchmark code for std::set_intersection() using one-sided binary search.
I'm thinking I'll do what Barry suggested and write a short paper
describing the change and benchmark results so that it's more
straightforward to consume. It's been so long since I last wrote a paper,
it's probably good exercise.
On Sat, 20 Jan 2024 at 17:54, Phil Endecott <std_proposals_list_at_[hidden]>
wrote:
> >> Personally I'd prefer something like this to have a name that told me
> >> what algorithm it actually is, e.g. one_sided_lower_bound.
>
> Iúri Chaer wrote:
> > Wouldn't that go counter to the design philosophy of the STL? lower_bound
> > is not the name of an algorithm, but of the action -- the algorithm
> behind
> > it in all the mainline implementations is the classic binary search, but
> > that's not explicit in its name.
>
> I don't want to get into a deep discussion about naming, but my personal
> preference is generally more towards using the accepted names of
> algorithms or data structures that we might find in textbooks, hence
> my use of "personally" in the quote above. Others will differ.
>
> But in the case of your proposal, an important consideration is that
> both classic binary search and one-sided binary search are useful for
> random-access containers, in different situations. If both exist they
> need different names. Classic binary search already has a name, so
> one-sided needs a different name.
>
> Doing any sort of binary search in an ordered std::list, or similar,
> is IMHO a somewhat exotic thing to want to do. Some might be surprised
> to discover that std::lower_bound and especially std::binary_search work
> at all with non-random-access iterators. You could even argue that it's
> dangerous for it to support data structures where its performance is
> linear rather than logarithmic, i.e. it would be better for the user to
> get an error or warning telling them to change their list to a vector,
> or to just do a std::find linear search.
>
> The reduced number of iterator increments that you report (n rather
> than 2n) is due to the classic binary search algorithm needing to
> find the length of the range when it starts. But in the case of
> std::list since C++11 we have constant-time size(), which
> std::lower_bound() doesn't exploit. Does
> std::ranges::lower_bound<std::list> do so? Using this would give n
> iterator mutations and log(n) comparisons, I think. Are there
> particular containers / iterators that you are interested in where
> the size of the range is not available in constant time?
>
>
> > I'd
> > be happy if the wording for the lower_bound guarantees were loosened to
> > allow the use of algorithms with better overall complexity. Perhaps it
> > could say something like "at most log(last - first)+O(1) comparisons for
> > random-access iterators, or O(log(last - first)) comparisons for other
> > iterator types".
>
> I wouldn't object to that specific change, if you had benchmarks showing
> an improvement for some non-contrived use case.
>
>
> Regards, Phil.
>
>
>
>
>
Received on 2024-01-22 14:34:19