ISOCPP std-proposals List: Re: [std-proposals] New vector range constructor to improve performance with forward ranges

From: Arthur O'Dwyer <arthur.j.odwyer_at_[hidden]>
Date: Sat, 12 Nov 2022 10:41:44 -0500

On Sat, Nov 12, 2022 at 7:01 AM Lénárd Szolnoki wrote:

> On Fri, 11 Nov 2022 18:51:03 -0500 Jason McKesson wrote:
> > On Fri, Nov 11, 2022 at 5:12 PM Lénárd Szolnoki wrote:
> > >
> > > I suppose this applies to unsized ranges. But range does not have
> > > to be random access to have std::ranges::size, which is O(1) and
> > > meant to be cheap.
> > >
> > > For sized ranges it absolutely makes sense to do the preallocation,
> > > and it doesn't require traversing the range twice.
>

+1.

> > > I would say that for unsized ranges it makes more sense to not do
> > > this, and do the usual multiple reallocations.
> >
> > I disagree. For most ranges that don't have a size, traversal is
> > inexpensive next to the cost of memory allocations (and the requisite
> > copying/moving of data). Even complex views that need to invoke some
> > functor (like `filter_view` and the like) will generally be cheaper
> > than doing potentially dozens of memory allocations. Indeed, it's
> > pretty rare to find a forward range for whom iteration is so slow that
> > multiple memory allocation is preferred.
>

Strong disagree — notice that in order to do "dozens of memory
allocations," you'd need to be allocating 2^dozens elements — 2^20 is
already a million elements. (Admittedly on MSVC they use a different growth
factor: 1.5^20 is only like 3000.)
Also, Ranges iterators can be arbitrarily slow to traverse: all you have to
do is stick a `views::transform` or `views::filter` in there, and boom,
it's super slow.
So I think Lénárd's call is absolutely right: `vector` should certainly
trust a `sized_range` to report its size accurately, but if it's *not* a
sized range, `vector` shouldn't do any pre-traversal. Pre-traversal can
even be *dangerous*, for an input range (such as an `istream_iterator`);
and I wouldn't trust it to be non-dangerous in practice even for a range
that self-reported as `forward_range`.

C++20 provides a completely standard way for a range to self-report its
size: it's `sized_range` and `std::ranges::size`. We don't need `vector`
second-guessing that approach.

(*However*, as seen below, it seems that all major library vendors disagree
with my take: they *will* use `distance` and preallocate.)

> I'm quite surprised on this benchmark's results, but here you go:
> https://quick-bench.com/q/usHB_TCGMgB8b7lnjettooYGVnc

(Executive summary: InputIterator is faster than ForwardIterator, on GCC
12.2 + libstdc++ at -O2.)

I'm surprised by that benchmark too!
On Clang + libc++, I get much more sensible numbers, with Forward better
than Input at all optimization levels greater than -O0:

$ clang++ x.cpp -std=c++20 -I... -L... -lbenchmark -lbenchmark_main -g

$ ./a.out

[...]

------------------------------------------------------------

Benchmark Time CPU Iterations

------------------------------------------------------------

VectorFromForward 113493 ns 113403 ns 6319

VectorFromInput 110535 ns 110468 ns 6243

$ clang++ x.cpp -std=c++20 -I... -L... -lbenchmark -lbenchmark_main -g -O1

$ ./a.out

[...]

------------------------------------------------------------

Benchmark Time CPU Iterations

------------------------------------------------------------

VectorFromForward 1656 ns 1655 ns 396987

VectorFromInput 3045 ns 3043 ns 226321

Even knocking your `arr_size` number all the way up to 1'024'000, I still
get these numbers at -O2:

------------------------------------------------------------

Benchmark Time CPU Iterations

------------------------------------------------------------

VectorFromForward 1363242 ns 1361088 ns 510

VectorFromInput 2429556 ns 2427262 ns 298

As far as I can tell, libc++* does do the optimization we're discussing!*
It sees that the iterator is a Cpp17ForwardIterator and uses
`std::distance` (not `std::ranges::distance`, because libc++ hasn't yet
implemented C++20 `vector`) to retrieve the number of elements and
pre-allocate that much. So the only difference here *seems* to be that
libc++ and/or Clang can do `std::distance` of these filter_view::iterators
much faster than GCC+libstdc++, for some reason.
I have no idea why that should be, and would be interested to learn, if
anyone digs into it.

–Arthur

Received on 2022-11-12 15:41:58