Date: Wed, 15 Jan 2025 06:17:14 -0800
On Wednesday 15 January 2025 00:41:27 Pacific Standard Time Tiago Freire wrote:
> > And anyway, what's the point? Shrinkable memory allocations already exist
> > on the heap and freeing/shrinking memory is incredibly fast. Memory
> > allocation on the heap is also very fast, if the allocator is using a
> > per-thread arena like it should. So stack-based allocation should only be
> > used for incredibly simple allocations that can be very fast and are also
> > not very big. Otherwise, just use the heap.
>
> Well yes. It is intended to support lightweight algorithms that benefit from
> variable memory size but where the memory requirements are statistically
> small but theoretically could be much larger, and where dipping into the
> heap (and the trappings associated with it) causes a disproportional amount
> of burden that could otherwise be avoided.
You can only use the stack when you deterministically know that it's going to
be small (at most 1 MB for general systems). If you don't know that for sure,
use the heap, because trying to support both will just make your code more
complex and likely outweigh the performance benefit of having used the stack.
Here's an example of valid use of alloca()-like indeterminate size buffer: the
array for sched_getaffinity() on Linux. Your code should try a small but
reasonable size once (1024 bits) and see if the kernel tells you that you need
more; if you do, then use the stack with a bigger value until you get the
entire full bit set. This example shows:
* a growing array
* that has a reasonable upper limit
Even at 1 MB, we'd be talking about a system with 2^23 logical processors,
which doesn't exist today. Even then, a system with more than 1024 logical
processors NEEDS to have a fast heap allocator so it that malloc() doesn't
become a bottleneck for massively-parallel applications.
> This is a hard requirement. I've always been uncomfortable with the status
> quo "hey, you have this piece of memory that you must use in order for your
> application to even function, it's quite limited in size and be very
> careful not to go over its limit. And by the way, I'm not going to tell you
> how much that is, and it's really hard to estimate how much you really
> need, and it is almost impossible to recover if you mess it up, good
> luck!"
Good luck getting that, then. The C++ standard is not going to look into
making this a requirement unless it's already been made feasible in the
majority of environments.
> I've realized that it might need some operating system support, if
> that's the case so be it, let's see how far we can go and how far we can
> get to it. "low-level C library" probably not, if it requires expensive
> system calls to do it, it almost completely defeats the purpose of it, and
> you might as well just create your own "alternative stack" that you manage
> yourself.
Not exactly. All processes are created with a single thread, so the only one
the operating system needs to tell the runtime about is the starting thread.
It wouldn't be difficult to include that information in the auxv or similar: a
simple pointer where the stack ends. Then the C library simply stores that.
For all other regular threads, the C library either tells the OS what the
stack size will be (e.g., Windows with CreateThread's dwStackSize) or is
responsible for allocating the stack in the first place (Unix systems in
general). So it will either know for sure the end of the stack or can estimate
it reasonably well. Like for the main thread, this information can simply be
stored in the TCB, which must work because C++ requires thread_local to work.
For manual context switching (fibres, setcontext(), etc.), the framework in
question needs to be updated. In the case of setcontext(), the stack is a
parameter, so the function can simply update the TCB with the new values. The
runtime should also provide a way to update the values stored so that other
implementations can achieve the same.
The one case we'll go back to requiring OS support are altstack signals
because the OS will change the stack but not other thread-specific controls.
But this is also a case where we may not need it in the first place: the
altstack is usually used in the first place for crash handlers, so the code
running there already knows the size of the stack.
> A compiler could pre-determined based on how
> much variables are declared and the scope of what functions get called and
> what gets returned, how much storage would it need to service the "fixed"
> portion of the functions needs and pre-reserve that and enforce the
> "variable" portion to then just happen at the end, allowing to freely
> expand or contract this region without fragmentation. You may not
> necessarily want to implement all functions in this way, but you could
> opt-in if you whished to use this feature. But that depends on how
> compilers make functions reserve the stack right now and still are able to
> make use of "alloca". I don't know, I'm not familiar, hence I would need
> help on this, but maybe it already works this way and thus the barrier for
> implementation would be much lower.
You're describing how alloca already works. Maybe you need to describe here
what you want to achieve, not the means by which you'd implement it.
> > Why not? They just modify the stack frame of the function they've been
> > inlined into or the compiler restores the stack when the block is done
> > with.
>
> The problem I was thinking is that you can only move the stack back or
> forwards, or recover a previous state which typically occurs when returning
> from a function,
>From the point of view of the compiler, an inlined function does return too.
At that point, the compiler can choose to either rewind the stack or not. For
example, if this function is called in a loop, I'd expect it to rewind the
stack so it could expand again - that's the case of the sched_getaffinity()
example above. But if it's not called in a loop, the compiler can simply defer
the stack restoration to the outer function's frame.
> > And anyway, what's the point? Shrinkable memory allocations already exist
> > on the heap and freeing/shrinking memory is incredibly fast. Memory
> > allocation on the heap is also very fast, if the allocator is using a
> > per-thread arena like it should. So stack-based allocation should only be
> > used for incredibly simple allocations that can be very fast and are also
> > not very big. Otherwise, just use the heap.
>
> Well yes. It is intended to support lightweight algorithms that benefit from
> variable memory size but where the memory requirements are statistically
> small but theoretically could be much larger, and where dipping into the
> heap (and the trappings associated with it) causes a disproportional amount
> of burden that could otherwise be avoided.
You can only use the stack when you deterministically know that it's going to
be small (at most 1 MB for general systems). If you don't know that for sure,
use the heap, because trying to support both will just make your code more
complex and likely outweigh the performance benefit of having used the stack.
Here's an example of valid use of alloca()-like indeterminate size buffer: the
array for sched_getaffinity() on Linux. Your code should try a small but
reasonable size once (1024 bits) and see if the kernel tells you that you need
more; if you do, then use the stack with a bigger value until you get the
entire full bit set. This example shows:
* a growing array
* that has a reasonable upper limit
Even at 1 MB, we'd be talking about a system with 2^23 logical processors,
which doesn't exist today. Even then, a system with more than 1024 logical
processors NEEDS to have a fast heap allocator so it that malloc() doesn't
become a bottleneck for massively-parallel applications.
> This is a hard requirement. I've always been uncomfortable with the status
> quo "hey, you have this piece of memory that you must use in order for your
> application to even function, it's quite limited in size and be very
> careful not to go over its limit. And by the way, I'm not going to tell you
> how much that is, and it's really hard to estimate how much you really
> need, and it is almost impossible to recover if you mess it up, good
> luck!"
Good luck getting that, then. The C++ standard is not going to look into
making this a requirement unless it's already been made feasible in the
majority of environments.
> I've realized that it might need some operating system support, if
> that's the case so be it, let's see how far we can go and how far we can
> get to it. "low-level C library" probably not, if it requires expensive
> system calls to do it, it almost completely defeats the purpose of it, and
> you might as well just create your own "alternative stack" that you manage
> yourself.
Not exactly. All processes are created with a single thread, so the only one
the operating system needs to tell the runtime about is the starting thread.
It wouldn't be difficult to include that information in the auxv or similar: a
simple pointer where the stack ends. Then the C library simply stores that.
For all other regular threads, the C library either tells the OS what the
stack size will be (e.g., Windows with CreateThread's dwStackSize) or is
responsible for allocating the stack in the first place (Unix systems in
general). So it will either know for sure the end of the stack or can estimate
it reasonably well. Like for the main thread, this information can simply be
stored in the TCB, which must work because C++ requires thread_local to work.
For manual context switching (fibres, setcontext(), etc.), the framework in
question needs to be updated. In the case of setcontext(), the stack is a
parameter, so the function can simply update the TCB with the new values. The
runtime should also provide a way to update the values stored so that other
implementations can achieve the same.
The one case we'll go back to requiring OS support are altstack signals
because the OS will change the stack but not other thread-specific controls.
But this is also a case where we may not need it in the first place: the
altstack is usually used in the first place for crash handlers, so the code
running there already knows the size of the stack.
> A compiler could pre-determined based on how
> much variables are declared and the scope of what functions get called and
> what gets returned, how much storage would it need to service the "fixed"
> portion of the functions needs and pre-reserve that and enforce the
> "variable" portion to then just happen at the end, allowing to freely
> expand or contract this region without fragmentation. You may not
> necessarily want to implement all functions in this way, but you could
> opt-in if you whished to use this feature. But that depends on how
> compilers make functions reserve the stack right now and still are able to
> make use of "alloca". I don't know, I'm not familiar, hence I would need
> help on this, but maybe it already works this way and thus the barrier for
> implementation would be much lower.
You're describing how alloca already works. Maybe you need to describe here
what you want to achieve, not the means by which you'd implement it.
> > Why not? They just modify the stack frame of the function they've been
> > inlined into or the compiler restores the stack when the block is done
> > with.
>
> The problem I was thinking is that you can only move the stack back or
> forwards, or recover a previous state which typically occurs when returning
> from a function,
>From the point of view of the compiler, an inlined function does return too.
At that point, the compiler can choose to either rewind the stack or not. For
example, if this function is called in a loop, I'd expect it to rewind the
stack so it could expand again - that's the case of the sched_getaffinity()
example above. But if it's not called in a loop, the compiler can simply defer
the stack restoration to the outer function's frame.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel DCAI Platform & System Engineering
Received on 2025-01-15 14:17:20