This email is broken into three parts for clarity.
[=My Perspective=]

I see “being unable to communicate intent” as the main problem with high-performance memory allocation. When I need space on the heap, it is usually in one of two patterns. Sometimes I need a quick, temporary buffer in the middle of a calculation that cannot exist on the stack for some reason. This is allocated in a LIFO pattern (as the stack frames that require them push and pop). Other times I need to store persistent state, such as a simulation or an in-memory database. Other people may have different patterns because they have different needs.

The key point is how different these patterns are.

There are workarounds, such as small string optimization and pmr::monotonic_buffer_resources. SSO is an automatic win, which is nice, but the things that require the programmer to do something specific tend to be algorithm-focused (rather than intent-focused). Adding third-party dependencies for low-level memory access can be unsettling, too.


So immediately after suggesting that we do not focus on algorithms, here is an algorithm.

In the case of LIFO allocations that must be on the heap, a large block of memory can be reserved and sliced up like a stack. Allocations grow the stack, and deallocations shrink the stack. If the page cannot store the next allocation, then a new block can be allocated (or it can be given to a different allocator). This tends to be fast, but, if a container resizes, then it will create gaps that waste memory until the stack is able to shrink. Live allocations bury the dead allocations behind them (until they deallocate too).

My algorithm uses two stacks, one at the bottom of the page allocating up, and one at the top of the page allocating down. Both stacks allocate toward the center, so they make full use of the buffer, but allocations alternate between the two stacks. During a resize, if the other stack happens to be chosen, and the first stack has nothing above the old buffer, then the old buffer will be free after the container copies its contents out of the old buffer. This is likely to help, for example, a single container that resizes many times in a row.

It has the following properties:
Godbolt: https://godbolt.org/z/znq1j3
GitHub: https://github.com/ScottMichaud/MiscCpp_DoubleStackAllocator

One benchmark used a Mersenne Twister to push random numbers onto a vector. The elements of that vector are then pushed onto two other vectors, one for even numbers and one for odd numbers. Each of these benchmarks were run on a 16-core, 32-thread AMD processor running Windows 10.
Imgur of profiler results: https://imgur.com/a/E5h3OdH

The actual proposal has nothing to do with any specific algorithm. This just shows that solutions can optimize better when they know more about how the allocation will be used (even though I was slower in one case). The platform has no idea what I am doing if I just use std::allocator.

[=Request for Feedback=]

As mentioned, the root problem seems to be a lack of communication between the programmer and the implementation. What I would like to see is a series of custom allocators be named based on specific use cases. Do not burden the implementations to push a specific algorithm, but rather allow them to decide how to handle the individual cases.

Going back to the LIFO, per-thread example:
The proposal is not about algorithms. The proposal is about defining a set of common usage patterns, and naming allocators according to those patterns. I believe that targeting goals (like a hypothetical std::temporary_allocator, etc.) matches what the application and platform want. This is especially true on the education front. “Use the temporary allocator when you want to allocate a temporary” sounds much easier to learn and teach than “Use a single-stack allocator when you want to allocate a temporary”.

It also allows vendors to change their implementation over time.
Thanks for your time and your consideration!