Hello,

This email is broken into three parts for clarity.

My thoughts on the root problems of memory allocation.
A demo of a new (?) allocator algorithm.
A proposal and request for feedback.

[=My Perspective=]

I see “being unable to communicate intent” as the main problem with high-performance memory allocation. When I need space on the heap, it is usually in one of two patterns. Sometimes I need a quick, temporary buffer in the middle of a calculation that cannot exist on the stack for some reason. This is allocated in a LIFO pattern (as the stack frames that require them push and pop). Other times I need to store persistent state, such as a simulation or an in-memory database. Other people may have different patterns because they have different needs.

The key point is how different these patterns are.

There are workarounds, such as small string optimization and pmr::monotonic_buffer_resources. SSO is an automatic win, which is nice, but the things that require the programmer to do something specific tend to be algorithm-focused (rather than intent-focused). Adding third-party dependencies for low-level memory access can be unsettling, too.

[=Demo=]

So immediately after suggesting that we do not focus on algorithms, here is an algorithm.

In the case of LIFO allocations that must be on the heap, a large block of memory can be reserved and sliced up like a stack. Allocations grow the stack, and deallocations shrink the stack. If the page cannot store the next allocation, then a new block can be allocated (or it can be given to a different allocator). This tends to be fast, but, if a container resizes, then it will create gaps that waste memory until the stack is able to shrink. Live allocations bury the dead allocations behind them (until they deallocate too).

My algorithm uses two stacks, one at the bottom of the page allocating up, and one at the top of the page allocating down. Both stacks allocate toward the center, so they make full use of the buffer, but allocations alternate between the two stacks. During a resize, if the other stack happens to be chosen, and the first stack has nothing above the old buffer, then the old buffer will be free after the container copies its contents out of the old buffer. This is likely to help, for example, a single container that resizes many times in a row.

It has the following properties:

The stacks can expand indefinitely by adding extra pages to the vector.
It can be as tightly packed as the configurable alignment allows. (Demo is set to 64 bytes.)
Some allocations can cleanly deallocate early.
Each system is per-thread (thread_local) so that thread pool tasks clean up on completion.
Any types can be stored together in the page.

Godbolt: https://godbolt.org/z/znq1j3
GitHub: https://github.com/ScottMichaud/MiscCpp_DoubleStackAllocator

One benchmark used a Mersenne Twister to push random numbers onto a vector. The elements of that vector are then pushed onto two other vectors, one for even numbers and one for odd numbers. Each of these benchmarks were run on a 16-core, 32-thread AMD processor running Windows 10.

GCC 10.2 on MSYS2 took ~30% more time than std::allocate (one thread, release build).

std::allocator had better start to finish performance by ~1.25x.

MSVC 2019 took ~40% less time than std::allocate (one thread, release build).

DoubleStackAllocator had better start to finish performance by ~1.65x

MSVC 2019 took ~60% less time than std::allocate when running 15 threads (release build).

DoubleStackAllocator had better start to finish performance by ~2.5x.

Imgur of profiler results: https://imgur.com/a/E5h3OdH

The actual proposal has nothing to do with any specific algorithm. This just shows that solutions can optimize better when they know more about how the allocation will be used (even though I was slower in one case). The platform has no idea what I am doing if I just use std::allocator.

[=Request for Feedback=]

As mentioned, the root problem seems to be a lack of communication between the programmer and the implementation. What I would like to see is a series of custom allocators be named based on specific use cases. Do not burden the implementations to push a specific algorithm, but rather allow them to decide how to handle the individual cases.

Going back to the LIFO, per-thread example:

Some vendors might want to use the double-stack allocator.
Some vendors might want to use a single-stack allocator.
Some vendors might want to bypass RAM altogether for the first page(s) of each thread.

Why not let vendors set aside some asymmetric memory per hardware thread?

Some vendors may just use std::allocator (and that’s fine).
Embedded device vendors might have a different idea altogether.

The proposal is not about algorithms. The proposal is about defining a set of common usage patterns, and naming allocators according to those patterns. I believe that targeting goals (like a hypothetical std::temporary_allocator, etc.) matches what the application and platform want. This is especially true on the education front. “Use the temporary allocator when you want to allocate a temporary” sounds much easier to learn and teach than “Use a single-stack allocator when you want to allocate a temporary”.

It also allows vendors to change their implementation over time.

What usage patterns would the STL want allocators for?

Short-term LIFO?
Some duration of FIFO?
When lots of allocations are all the same size and alignment?
Alignments in general?
Debug allocators to help tooling?
Automatic zeroing on deallocation for security reasons?

What would constitute a good name for a goal-based allocator?
Do implementors and platform stakeholders have any thoughts?
Who would dictate things like buffer size and alignment? The programmer, the platform, or the STL?
Would programmers even be interested in using standards-defined allocators?

Thanks for your time and your consideration!