std-proposals: Re: Cache Alloc - 1.3x to 13.4x faster

From: Phil Bouchard <boost_at_[hidden]>
Date: Tue, 27 Jul 2021 23:25:12 -0400

On 7/27/21 2:46 PM, Jason McKesson wrote:
> On Tue, Jul 27, 2021 at 12:26 PM Phil Bouchard <boost_at_[hidden]> wrote:
>>
>> On 7/27/21 11:53 AM, Jason McKesson via Std-Proposals wrote:
>>
>> On Tue, Jul 27, 2021 at 11:21 AM Phil Bouchard via Std-Proposals
>> <std-proposals_at_[hidden]> wrote:
>>
>> So did you guys have a chance to try it out? Any questions?
>>
>> My main question is this. If I care about performance, and I'm in some
>> hot loop code where I need a stack/queue, and that stack/queue needs
>> to be so large that a stack buffer or fixed-size globally allocated
>> buffer won't do (a case with so many qualifiers that makes this a
>> rarified circumstance)... why would I *ever* use `std::list` for this?
>> `std::list` is almost always the *wrong* type if performance matters
>> to you.
>>
>> If you care about performance is the main reason people use C++ in the first place so C++ should offer the more efficient algorithms available.
> That doesn't explain why you want to use `std::list`, which is almost
> never "the more efficient algorithms available".
>
>> Sorry if my questions sound stupid but I don't understand your dislike of std::list. Yes I can use a more efficient chained buffer but the complexity remain the same.
> Complexity != performance.

Sorry I was referring to the constant complexity (O(1)).

>
> This is not about "like" vs. "dislike". From an all-around performance
> perspective, `std::list` is just *not good*. Poor cache locality
> between elements (even with your pool allocator), oversized data makes
> storing small objects incredibly inefficient memory-wise (what's the
> point of having a better allocator if you're allocating 6x more memory
> than you need?), etc. The only justifiable reasons to use one are when
> you need to insert/remove arbitrarily *a lot*, and even then you'll
> often find a `vector` outperforming it in cases where `T` is trivial.
>
>> Consider this case. Each "page" allocation for a simple
>> `std::list<int>` will be *much* larger than it needs to be. That
>> substantially damages the cache locality of accessing the elements.
>>
>> By cache locality are you referring to the bus cache?
>>
>>
>> This is more important for a queue than a stack, but it still matters
>> when you're actually *doing something* in the hot loop.
>>
>> `std::list` is good for maybe one or two use cases, all of which are
>> primarily about insertion/removal from the *middle*, not from the
>> ends.
>>
>> A dedicated stack or queue object that knows what it's doing will beat
>> it every single time.
>>
>> Why is an allocator-based solution to this problem (which has only
>> nebulously been specified, BTW) a good idea? That's the question you
>> haven't answered as of yet.
>>
>> What the allocator does is to gather allocations of the same type together so heap objects are not scattered around the memory therefore fragmenting the memory.
>>
>> This generic allocator can be used for overlayed lists, stacks, queues, maps, multimaps, unordered map, etc of the same type and frequency usage.
>>
>> Yes it does add a header to each element allocated but sometimes speed is what is required to the expense of more memory chips.
>>
>> My main goal is to prove that this compile-time allocator is more efficient than these LD_PRELOAD-style allocators out there.
> There is nothing especially "compile-time" about your allocator. And I
> don't see why I would use it when
> `std::pmr::unsycnhronized_pool_resource` exists. And if the virtual
> overhead of the type really bothers you, you can easily wrap it in a
> proper allocator type that hard-codes the resource type, so that
> devirtualization will take place.
>
> Also, shouldn't the goal be to improve performance? Beating the
> generic memory allocator could be a means to achieve that, in certain
> cases, but wouldn't it make more sense to identify specific poorly
> performing cases and target them? Your proposal is a solution in
> search of a problem.

So upon further investigations, by comparing it with
boost::fast_pool_allocator, cache_alloc:

- wins the 1st race by far (straight allocations);

- equals the 2nd one (allocs + deallocs);

- fails the 3rd one because of the destructor.

https://github.com/philippeb8/cache_alloc/blob/main/README.md

I'm not sure yet what's wrong with the destructor yet but if I find
anything valuable down the road I'll let you know.

Regards,

-- 
*Phil Bouchard*
Founder & CTO
C.: (819) 328-4743
Fornux Logo <http://www.fornux.com>

Received on 2021-07-27 22:25:21