ISOCPP Std Discussion List: Re: Structured dynamic stack allocation

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Wed, 15 Jan 2025 22:46:00 +0000

> You can only use the stack when you deterministically know that it's going to be small (at most 1 MB for general systems). If you don't know that for sure, use the heap, because trying to support both will just make your code more complex and likely outweigh the performance benefit of having used the stack.
> Here's an example of valid use of alloca()-like indeterminate size buffer: the array for sched_getaffinity() on Linux. Your code should try a small but reasonable size once (1024 bits) and see if the kernel tells you that you need more; if you do, then use the stack with a bigger value until you get the entire full bit set. This example shows:
> * a growing array
> * that has a reasonable upper limit
> Even at 1 MB, we'd be talking about a system with 2^23 logical processors, which doesn't exist today. Even then, a system with more than 1024 logical processors NEEDS to have a fast heap allocator so it that malloc() doesn't become a bottleneck for massively-parallel applications.

I have used alloca for example to implement a formatting library , the size of text being formatted is theoretically unbounded (or at least and impractical large number) in practice a few kilobytes will cover pretty much 100% of all practical cases. But because it could be larger, if it sees that it needs to much memory it goes to the heap instead of blowing the stack.
I've benchmarked it, and I'm doing a lot of crazy things like concatenating string, formatting doubles, taking timestamps converting them to a date and time and formatting that, putting everything together; just changing from using alloca to malloc added a whopping 20% to 25% increase in execution time. It's doing all sorts of expensive formatting and conversions, but that 1 call to malloc (and free) adds 25% to the cost of the entire thing.

I have also used it to process low-latency event system, the size of the events was not predictable by the system I was building (but the instigator would tell me how much memory was needed), theoretically messages could be large but because of the physical limitations of the components being modeled less than 1Kb would ever be needed 99.9% of the time. That can sit comfortably in my stack, and you bet that is where I'm going to put it, there's backup if statement to dip into the heap just in case of that 0.1%
And you could bet if I knew how much stack was available and how much it was safe to consume, that percentage would be even lower.

I would say that these are valid use cases, size limits are unknown in advance, could be large, but statistically isn't; and dipping to the heap is disproportionally costly.

It's awesome, I would definitely use it more in similar circumstances, had it been more practical to use and had it not been more unsafe the more you use it.

> Good luck getting that, then. The C++ standard is not going to look into making this a requirement unless it's already been made feasible in the majority of environments.

I believe it's worth the research, let's not be defeatist at the beginning of the journey because that problem may turn out not to be a problem. And if it works there's some interesting things we can do with it.

> You're describing how alloca already works. Maybe you need to describe here what you want to achieve, not the means by which you'd implement it.

I was just describing what you needed in order to avoid fragmentation and make it practical to resize both up and down. Which might need to change how certain implementations manages function level variables on the stack in order to make it work. Point being if that is how alloca already works (it got to work somehow with other stuff on the stack) then we don't really need to convince anybody to change anything because one of the requirements to make it work already is how it works. It immediately becomes much easier to sell.

> Not exactly. All processes are created with a single thread, so the only one the operating system needs to tell the runtime about is the starting thread.
> It wouldn't be difficult to include that information in the auxv or similar: a simple pointer where the stack ends. Then the C library simply stores that.
> For all other regular threads, the C library either tells the OS what the stack size will be (e.g., Windows with CreateThread's dwStackSize) or is responsible for allocating the stack in the first place (Unix systems in general). So it will either know for sure the end of the stack or can estimate it reasonably well. Like for the main thread, this information can simply be stored in the TCB, which must work because C++ requires thread_local to work.

Exactly, that's why it feels that this information shouldn't be all that difficult to get. It looks achievable, it feels like we should be able to do this.
That's why I want to give this idea a good effort.

> From the point of view of the compiler, an inlined function does return too.
> At that point, the compiler can choose to either rewind the stack or not. For example, if this function is called in a loop, I'd expect it to rewind the stack so it could expand again

Well, this is a problem right now when using alloca. It feels like it should be a compiler defect, but when code gets inlined it does not account for the fact that you might have moved the stack pointer so it doesn't roll it back, the only available way to move back that pointer is by returning from the function and not just faking it. You can easily blow up the stack if you are not careful.
gcc has this problem, and it is not considered a defect. That would have to change.

-----Original Message-----
From: Thiago Macieira <thiago_at_macieira.org>
Sent: Wednesday, January 15, 2025 3:17 PM
To: std-discussion_at_[hidden]; Tiago Freire <tmiguelf_at_hotmail.com>
Subject: Re: [std-discussion] Structured dynamic stack allocation

On Wednesday 15 January 2025 00:41:27 Pacific Standard Time Tiago Freire wrote:
> > And anyway, what's the point? Shrinkable memory allocations already
> > exist on the heap and freeing/shrinking memory is incredibly fast.
> > Memory allocation on the heap is also very fast, if the allocator is
> > using a per-thread arena like it should. So stack-based allocation
> > should only be used for incredibly simple allocations that can be
> > very fast and are also not very big. Otherwise, just use the heap.
>
> Well yes. It is intended to support lightweight algorithms that
> benefit from variable memory size but where the memory requirements
> are statistically small but theoretically could be much larger, and
> where dipping into the heap (and the trappings associated with it)
> causes a disproportional amount of burden that could otherwise be avoided.

You can only use the stack when you deterministically know that it's going to be small (at most 1 MB for general systems). If you don't know that for sure, use the heap, because trying to support both will just make your code more complex and likely outweigh the performance benefit of having used the stack.

Here's an example of valid use of alloca()-like indeterminate size buffer: the array for sched_getaffinity() on Linux. Your code should try a small but reasonable size once (1024 bits) and see if the kernel tells you that you need more; if you do, then use the stack with a bigger value until you get the entire full bit set. This example shows:
* a growing array
* that has a reasonable upper limit

Even at 1 MB, we'd be talking about a system with 2^23 logical processors, which doesn't exist today. Even then, a system with more than 1024 logical processors NEEDS to have a fast heap allocator so it that malloc() doesn't become a bottleneck for massively-parallel applications.

> This is a hard requirement. I've always been uncomfortable with the
> status quo "hey, you have this piece of memory that you must use in
> order for your application to even function, it's quite limited in
> size and be very careful not to go over its limit. And by the way, I'm
> not going to tell you how much that is, and it's really hard to
> estimate how much you really need, and it is almost impossible to
> recover if you mess it up, good luck!"

Good luck getting that, then. The C++ standard is not going to look into making this a requirement unless it's already been made feasible in the majority of environments.

> I've realized that it might need some operating system support, if
> that's the case so be it, let's see how far we can go and how far we
> can get to it. "low-level C library" probably not, if it requires
> expensive system calls to do it, it almost completely defeats the
> purpose of it, and you might as well just create your own "alternative
> stack" that you manage yourself.

Not exactly. All processes are created with a single thread, so the only one the operating system needs to tell the runtime about is the starting thread.
It wouldn't be difficult to include that information in the auxv or similar: a simple pointer where the stack ends. Then the C library simply stores that.

For all other regular threads, the C library either tells the OS what the stack size will be (e.g., Windows with CreateThread's dwStackSize) or is responsible for allocating the stack in the first place (Unix systems in general). So it will either know for sure the end of the stack or can estimate it reasonably well. Like for the main thread, this information can simply be stored in the TCB, which must work because C++ requires thread_local to work.

For manual context switching (fibres, setcontext(), etc.), the framework in question needs to be updated. In the case of setcontext(), the stack is a parameter, so the function can simply update the TCB with the new values. The runtime should also provide a way to update the values stored so that other implementations can achieve the same.

The one case we'll go back to requiring OS support are altstack signals because the OS will change the stack but not other thread-specific controls.
But this is also a case where we may not need it in the first place: the altstack is usually used in the first place for crash handlers, so the code running there already knows the size of the stack.

> A compiler could pre-determined based on how much variables are
> declared and the scope of what functions get called and what gets
> returned, how much storage would it need to service the "fixed"
> portion of the functions needs and pre-reserve that and enforce the
> "variable" portion to then just happen at the end, allowing to freely
> expand or contract this region without fragmentation. You may not
> necessarily want to implement all functions in this way, but you could
> opt-in if you whished to use this feature. But that depends on how
> compilers make functions reserve the stack right now and still are
> able to make use of "alloca". I don't know, I'm not familiar, hence I
> would need help on this, but maybe it already works this way and thus
> the barrier for implementation would be much lower.

You're describing how alloca already works. Maybe you need to describe here what you want to achieve, not the means by which you'd implement it.

> > Why not? They just modify the stack frame of the function they've
> > been inlined into or the compiler restores the stack when the block
> > is done with.
>
> The problem I was thinking is that you can only move the stack back or
> forwards, or recover a previous state which typically occurs when
> returning from a function,

From the point of view of the compiler, an inlined function does return too.
At that point, the compiler can choose to either rewind the stack or not. For example, if this function is called in a loop, I'd expect it to rewind the stack so it could expand again - that's the case of the sched_getaffinity() example above. But if it's not called in a loop, the compiler can simply defer the stack restoration to the outer function's frame.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel DCAI Platform & System Engineering

Received on 2025-01-15 22:46:07