Wow thanks for the code review!

The goal when I was designing it was for per-thread temporaries in a thread pool job system. I envisioned it as a sort of expansion pack for a thread's stack. Elements that would persist longer than a blocking function call are not designed for that allocator, so I didn't consider moving from thread to thread.

This tangentially hits the core point, though.

My proposal was suggesting that memory allocation should be goal-centric.

My thought is that this would be easy to teach and flexible to implement, especially if a vendor wants to try something way outside-the-box, like the "skipping main RAM and mapping directly to a cache" example.

If we say "have a stack-based allocator" then the implementation's hands are tied to an algorithm. If we say "have an allocator that's fast for locals" then they can take it and run as far as they want.

But again that's why I'm requesting feedback. I might be miles away from what typical stakeholders care about.

Thanks again!

On 11/5/2020 5:41 PM, Arthur O'Dwyer wrote:
On Thu, Nov 5, 2020 at 9:35 AM Scott Michaud via Std-Proposals <> wrote:

My algorithm uses two stacks, one at the bottom of the page allocating up, and one at the top of the page allocating down. [...]

My comment on the benchmark:
It looks to me as if you have another constraint on your allocation pattern. Memory allocated in thread A cannot be deallocated in thread B. I can't do, like,
    std::vector<int, DoubleStackAllocator<int>> v = {1,2,3};
    std::thread t = std::thread([v = std::move(v)]() {
        // do nothing
Is my understanding accurate?  If so, I think you should consider whether you're really doing an apples-to-apples comparison in your benchmark there.

But anyway, I think the benchmark is mostly a red herring. Everyone should agree that sometimes allocator A will outperform allocator B, and sometimes vice versa, depending on the workload. We don't really need benchmarks to prove that, IMHO.

I recommend that you take your DoubleStackAllocator<T> and re-code it as a class derived from C++17's std::pmr::memory_resource, so that this code works—

    DoubleStackMemoryResource mr;
    std::pmr::vector<int> v({1, 2, 3}, &mr);

Notice that now it becomes obvious that `mr` needs somewhere to live, and it needs some way to explain where it's getting its initial chunk of memory from.
In your current code, you kind of handwave that dependency by hiding it in a global variable (namely, that `thread_local std::vector<DoubleStack>` inside a static member function of DoubleStackAllocatorImpl). Pulling it out into the sunlight will improve the design.

"But std::pmr relies on virtual function calls, which are slow!" True. But I think the community knows how to deal with that, if one really needs to. I recall a conversation with Bryce Adelstein Lelbach and (I think) Michal Dominiak, several years ago, when the Thrust library was trying to do allocators for the first time. Basically, std::pmr::polymorphic_allocator<T> is a poor-man's-type-erased allocator that can point to any memory_resource; you just need to implement a monomorphic_allocator that can point to only one kind of memory_resource. Then the user gets to choose which one to use — polymorphic or monomorphic.

Something like this:
Unfortunately I do see a virtual call still happening in the right-hand pane, though; I'm not sure why Clang isn't devirtualizing it based on the knowledge that FinalOf<std::pmr::monotonic_buffer_resource> is a final class.

Note that my toy monomorphic_allocator<T,MR> is missing a lot of stuff that the real one would have to deal with, such as a converting constructor and POCMA/POCCA/POCS. I don't think that the diff in the godbolt above is attributable to its lack of any of that stuff, though.

Anyway, you should definitely look into C++17 std::pmr; it's doing basically the same kind of thing you're interested in.
I also highly recommend my own talk on the subject, "An Allocator is a Handle to a Heap," if you haven't already seen it: