Hello,
This email is broken into three parts for clarity.
- My thoughts on the root problems of memory allocation.
- A demo of a new (?) allocator algorithm.
- A proposal and request for feedback.
[=My Perspective=]
I see “being unable to communicate intent” as the main problem with
high-performance memory allocation. When I need space on the heap,
it is usually in one of two patterns. Sometimes I need a quick,
temporary buffer in the middle of a calculation that cannot exist on
the stack for some reason. This is allocated in a LIFO pattern (as
the stack frames that require them push and pop). Other times I need
to store persistent state, such as a simulation or an in-memory
database. Other people may have different patterns because they have
different needs.
The key point is how different these patterns are.
There are workarounds, such as small string optimization and
pmr::monotonic_buffer_resources. SSO is an automatic win, which is
nice, but the things that require the programmer to do something
specific tend to be algorithm-focused (rather than intent-focused).
Adding third-party dependencies for low-level memory access can be
unsettling, too.
[=Demo=]
So immediately after suggesting that we do not focus on algorithms,
here is an algorithm.
In the case of LIFO allocations that must be on the heap, a large
block of memory can be reserved and sliced up like a stack.
Allocations grow the stack, and deallocations shrink the stack. If
the page cannot store the next allocation, then a new block can be
allocated (or it can be given to a different allocator). This tends
to be fast, but, if a container resizes, then it will create gaps
that waste memory until the stack is able to shrink. Live
allocations bury the dead allocations behind them (until they
deallocate too).
My algorithm uses two stacks, one at the bottom of the page
allocating up, and one at the top of the page allocating down. Both
stacks allocate toward the center, so they make full use of the
buffer, but allocations alternate between the two stacks. During a
resize, if the other stack happens to be chosen, and the first stack
has nothing above the old buffer, then the old buffer will be free
after the container copies its contents out of the old buffer. This
is likely to help, for example, a single container that resizes many
times in a row.
It has the following properties:
- The stacks can expand indefinitely by adding extra pages to
the vector.
- It can be as tightly packed as the configurable alignment
allows. (Demo is set to 64 bytes.)
- Some allocations can cleanly deallocate early.
- Each system is per-thread (thread_local) so that thread pool
tasks clean up on completion.
- Any types can be stored together in the page.
Godbolt: https://godbolt.org/z/znq1j3
GitHub: https://github.com/ScottMichaud/MiscCpp_DoubleStackAllocator
One benchmark used a Mersenne Twister to push random numbers onto a
vector. The elements of that vector are then pushed onto two other
vectors, one for even numbers and one for odd numbers. Each of these
benchmarks were run on a 16-core, 32-thread AMD processor running
Windows 10.
- GCC 10.2 on MSYS2 took ~30% more time than std::allocate (one
thread, release build).
- std::allocator had better start to finish performance by
~1.25x.
- MSVC 2019 took ~40% less time than std::allocate (one thread,
release build).
- DoubleStackAllocator had better start to finish performance
by ~1.65x
- MSVC 2019 took ~60% less time than std::allocate when running
15 threads (release build).
- DoubleStackAllocator had better start to finish performance
by ~2.5x.
Imgur of profiler results: https://imgur.com/a/E5h3OdH
The actual proposal has nothing to do with any specific algorithm.
This just shows that solutions can optimize better when they know
more about how the allocation will be used (even though I was slower
in one case). The platform has no idea what I am doing if I just use
std::allocator.
[=Request for Feedback=]
As mentioned, the root problem seems to be a lack of communication
between the programmer and the implementation. What I would like to
see is a series of custom allocators be named based on specific use
cases. Do not burden the implementations to push a specific
algorithm, but rather allow them to decide how to handle the
individual cases.
Going back to the LIFO, per-thread example:
- Some vendors might want to use the double-stack allocator.
- Some vendors might want to use a single-stack allocator.
- Some vendors might want to bypass RAM altogether for the first
page(s) of each thread.
- Why not let vendors set aside some asymmetric memory per
hardware thread?
- Some vendors may just use std::allocator (and that’s fine).
- Embedded device vendors might have a different idea
altogether.
The proposal is not about algorithms. The proposal is about defining
a set of common usage patterns, and naming allocators
according to those patterns. I believe that targeting goals
(like a hypothetical std::temporary_allocator, etc.) matches what
the application and platform want. This is especially true on the
education front. “Use the temporary allocator when you want to
allocate a temporary” sounds much easier to learn and teach than
“Use a single-stack allocator when you want to allocate a
temporary”.
It also allows vendors to change their implementation over time.
- What usage patterns would the STL want allocators for?
- Short-term LIFO?
- Some duration of FIFO?
- When lots of allocations are all the same size and
alignment?
- Alignments in general?
- Debug allocators to help tooling?
- Automatic zeroing on deallocation for security reasons?
- What would constitute a good name for a goal-based allocator?
- Do implementors and platform stakeholders have any thoughts?
- Who would dictate things like buffer size and alignment? The
programmer, the platform, or the STL?
- Would programmers even be interested in using
standards-defined allocators?
Thanks for your time and your consideration!