Date: Thu, 5 Nov 2020 09:34:56 -0500
Hello,
This email is broken into three parts for clarity.
* My thoughts on the root problems of memory allocation.
* A demo of a new (?) allocator algorithm.
* A proposal and request for feedback.
*[=My Perspective=]*
I see “being unable to communicate intent” as the main problem with
high-performance memory allocation. When I need space on the heap, it is
usually in one of two patterns. Sometimes I need a quick, temporary
buffer in the middle of a calculation that cannot exist on the stack for
some reason. This is allocated in a LIFO pattern (as the stack frames
that require them push and pop). Other times I need to store persistent
state, such as a simulation or an in-memory database. Other people may
have different patterns because they have different needs.
The key point is how different these patterns are.
There are workarounds, such as small string optimization and
pmr::monotonic_buffer_resources. SSO is an automatic win, which is nice,
but the things that require the programmer to do something specific tend
to be algorithm-focused (rather than intent-focused). Adding third-party
dependencies for low-level memory access can be unsettling, too.
*[=Demo=]*
So immediately after suggesting that we do not focus on algorithms, here
is an algorithm.
In the case of LIFO allocations that must be on the heap, a large block
of memory can be reserved and sliced up like a stack. Allocations grow
the stack, and deallocations shrink the stack. If the page cannot store
the next allocation, then a new block can be allocated (or it can be
given to a different allocator). This tends to be fast, but, if a
container resizes, then it will create gaps that waste memory until the
stack is able to shrink. Live allocations bury the dead allocations
behind them (until they deallocate too).
My algorithm uses two stacks, one at the bottom of the page allocating
up, and one at the top of the page allocating down. Both stacks allocate
toward the center, so they make full use of the buffer, but allocations
alternate between the two stacks. During a resize, if the other stack
happens to be chosen, and the first stack has nothing above the old
buffer, then the old buffer will be free after the container copies its
contents out of the old buffer. This is likely to help, for example, a
single container that resizes many times in a row.
It has the following properties:
* The stacks can expand indefinitely by adding extra pages to the vector.
* It can be as tightly packed as the configurable alignment allows.
(Demo is set to 64 bytes.)
* Some allocations can cleanly deallocate early.
* Each system is per-thread (thread_local) so that thread pool tasks
clean up on completion.
* Any types can be stored together in the page.
Godbolt: https://godbolt.org/z/znq1j3
GitHub: https://github.com/ScottMichaud/MiscCpp_DoubleStackAllocator
One benchmark used a Mersenne Twister to push random numbers onto a
vector. The elements of that vector are then pushed onto two other
vectors, one for even numbers and one for odd numbers. Each of these
benchmarks were run on a 16-core, 32-thread AMD processor running
Windows 10.
* GCC 10.2 on MSYS2 took ~30% more time than std::allocate (one
thread, release build).
o std::allocator had better start to finish performance by ~1.25x.
* MSVC 2019 took ~40% less time than std::allocate (one thread,
release build).
o DoubleStackAllocator had better start to finish performance by
~1.65x
* MSVC 2019 took ~60% less time than std::allocate when running 15
threads (release build).
o DoubleStackAllocator had better start to finish performance by
~2.5x.
Imgur of profiler results: https://imgur.com/a/E5h3OdH
The actual proposal has nothing to do with any specific algorithm. This
just shows that solutions can optimize better when they know more about
how the allocation will be used (even though I was slower in one case).
The platform has no idea what I am doing if I just use std::allocator.
*[=Request for Feedback=]*
As mentioned, the root problem seems to be a lack of communication
between the programmer and the implementation. What I would like to see
is a series of custom allocators be named based on specific use cases.
Do not burden the implementations to push a specific algorithm, but
rather allow them to decide how to handle the individual cases.
Going back to the LIFO, per-thread example:
* Some vendors might want to use the double-stack allocator.
* Some vendors might want to use a single-stack allocator.
* Some vendors might want to bypass RAM altogether for the first
page(s) of each thread.
o Why not let vendors set aside some asymmetric memory per
hardware thread?
* Some vendors may just use std::allocator (and that’s fine).
* Embedded device vendors might have a different idea altogether.
The proposal is not about algorithms. The proposal is about *defining a
set of common usage patterns*, and *naming allocators according to those
patterns*. I believe that targeting goals (like a hypothetical
std::temporary_allocator, etc.) matches what the application and
platform want. This is especially true on the education front. “Use the
temporary allocator when you want to allocate a temporary” sounds much
easier to learn and teach than “Use a single-stack allocator when you
want to allocate a temporary”.
It also allows vendors to change their implementation over time.
* What usage patterns would the STL want allocators for?
o Short-term LIFO?
o Some duration of FIFO?
o When lots of allocations are all the same size and alignment?
o Alignments in general?
o Debug allocators to help tooling?
o Automatic zeroing on deallocation for security reasons?
* What would constitute a good name for a goal-based allocator?
* Do implementors and platform stakeholders have any thoughts?
* Who would dictate things like buffer size and alignment? The
programmer, the platform, or the STL?
* Would programmers even be interested in using standards-defined
allocators?
Thanks for your time and your consideration!
This email is broken into three parts for clarity.
* My thoughts on the root problems of memory allocation.
* A demo of a new (?) allocator algorithm.
* A proposal and request for feedback.
*[=My Perspective=]*
I see “being unable to communicate intent” as the main problem with
high-performance memory allocation. When I need space on the heap, it is
usually in one of two patterns. Sometimes I need a quick, temporary
buffer in the middle of a calculation that cannot exist on the stack for
some reason. This is allocated in a LIFO pattern (as the stack frames
that require them push and pop). Other times I need to store persistent
state, such as a simulation or an in-memory database. Other people may
have different patterns because they have different needs.
The key point is how different these patterns are.
There are workarounds, such as small string optimization and
pmr::monotonic_buffer_resources. SSO is an automatic win, which is nice,
but the things that require the programmer to do something specific tend
to be algorithm-focused (rather than intent-focused). Adding third-party
dependencies for low-level memory access can be unsettling, too.
*[=Demo=]*
So immediately after suggesting that we do not focus on algorithms, here
is an algorithm.
In the case of LIFO allocations that must be on the heap, a large block
of memory can be reserved and sliced up like a stack. Allocations grow
the stack, and deallocations shrink the stack. If the page cannot store
the next allocation, then a new block can be allocated (or it can be
given to a different allocator). This tends to be fast, but, if a
container resizes, then it will create gaps that waste memory until the
stack is able to shrink. Live allocations bury the dead allocations
behind them (until they deallocate too).
My algorithm uses two stacks, one at the bottom of the page allocating
up, and one at the top of the page allocating down. Both stacks allocate
toward the center, so they make full use of the buffer, but allocations
alternate between the two stacks. During a resize, if the other stack
happens to be chosen, and the first stack has nothing above the old
buffer, then the old buffer will be free after the container copies its
contents out of the old buffer. This is likely to help, for example, a
single container that resizes many times in a row.
It has the following properties:
* The stacks can expand indefinitely by adding extra pages to the vector.
* It can be as tightly packed as the configurable alignment allows.
(Demo is set to 64 bytes.)
* Some allocations can cleanly deallocate early.
* Each system is per-thread (thread_local) so that thread pool tasks
clean up on completion.
* Any types can be stored together in the page.
Godbolt: https://godbolt.org/z/znq1j3
GitHub: https://github.com/ScottMichaud/MiscCpp_DoubleStackAllocator
One benchmark used a Mersenne Twister to push random numbers onto a
vector. The elements of that vector are then pushed onto two other
vectors, one for even numbers and one for odd numbers. Each of these
benchmarks were run on a 16-core, 32-thread AMD processor running
Windows 10.
* GCC 10.2 on MSYS2 took ~30% more time than std::allocate (one
thread, release build).
o std::allocator had better start to finish performance by ~1.25x.
* MSVC 2019 took ~40% less time than std::allocate (one thread,
release build).
o DoubleStackAllocator had better start to finish performance by
~1.65x
* MSVC 2019 took ~60% less time than std::allocate when running 15
threads (release build).
o DoubleStackAllocator had better start to finish performance by
~2.5x.
Imgur of profiler results: https://imgur.com/a/E5h3OdH
The actual proposal has nothing to do with any specific algorithm. This
just shows that solutions can optimize better when they know more about
how the allocation will be used (even though I was slower in one case).
The platform has no idea what I am doing if I just use std::allocator.
*[=Request for Feedback=]*
As mentioned, the root problem seems to be a lack of communication
between the programmer and the implementation. What I would like to see
is a series of custom allocators be named based on specific use cases.
Do not burden the implementations to push a specific algorithm, but
rather allow them to decide how to handle the individual cases.
Going back to the LIFO, per-thread example:
* Some vendors might want to use the double-stack allocator.
* Some vendors might want to use a single-stack allocator.
* Some vendors might want to bypass RAM altogether for the first
page(s) of each thread.
o Why not let vendors set aside some asymmetric memory per
hardware thread?
* Some vendors may just use std::allocator (and that’s fine).
* Embedded device vendors might have a different idea altogether.
The proposal is not about algorithms. The proposal is about *defining a
set of common usage patterns*, and *naming allocators according to those
patterns*. I believe that targeting goals (like a hypothetical
std::temporary_allocator, etc.) matches what the application and
platform want. This is especially true on the education front. “Use the
temporary allocator when you want to allocate a temporary” sounds much
easier to learn and teach than “Use a single-stack allocator when you
want to allocate a temporary”.
It also allows vendors to change their implementation over time.
* What usage patterns would the STL want allocators for?
o Short-term LIFO?
o Some duration of FIFO?
o When lots of allocations are all the same size and alignment?
o Alignments in general?
o Debug allocators to help tooling?
o Automatic zeroing on deallocation for security reasons?
* What would constitute a good name for a goal-based allocator?
* Do implementors and platform stakeholders have any thoughts?
* Who would dictate things like buffer size and alignment? The
programmer, the platform, or the STL?
* Would programmers even be interested in using standards-defined
allocators?
Thanks for your time and your consideration!
Received on 2020-11-05 08:35:00