Date: Wed, 15 Jan 2025 16:26:45 -0800
On Wednesday 15 January 2025 14:46:00 Pacific Standard Time Tiago Freire wrote:
> I have used alloca for example to implement a formatting library , the size
> of text being formatted is theoretically unbounded (or at least and
> impractical large number) in practice a few kilobytes will cover pretty
> much 100% of all practical cases. But because it could be larger, if it
> sees that it needs to much memory it goes to the heap instead of blowing
> the stack.
And because it's unbounded, you must have code to switch to the heap instead.
At that point, is it still faster having used the stack, rather than going
straight for the heap?
And do you need dynamic stack arrays? For example, we have QVarLengthArray in
Qt, which is a *static* sized stack array and it can grow to use the heap
instead.
> I've benchmarked it, and I'm doing a lot of crazy things like
> concatenating string, formatting doubles, taking timestamps converting them
> to a date and time and formatting that, putting everything together; just
> changing from using alloca to malloc added a whopping 20% to 25% increase
> in execution time. It's doing all sorts of expensive formatting and
> conversions, but that 1 call to malloc (and free) adds 25% to the cost of
> the entire thing.
On Linux? My experience is that malloc runs in about 20 cycles. That's still
far more than a stack allocation (< 1 cycle), but in absolute terms it isn't a
lot. So I'm surprised it's a large portion of your total. Formatting doubles
is going to be hundreds of cycles, for example.
> I have also used it to process low-latency event system, the size of the
> events was not predictable by the system I was building (but the instigator
> would tell me how much memory was needed), theoretically messages could be
> large but because of the physical limitations of the components being
> modeled less than 1Kb would ever be needed 99.9% of the time. That can sit
> comfortably in my stack, and you bet that is where I'm going to put it,
> there's backup if statement to dip into the heap just in case of that 0.1%
That's the usecase of QVarLengthArray. Implementing it does not require any of
what you've requested, seeing as we've had it for 20 years. For example, we
use it in the parsing numbers stored in QString: as std::from_chars only
operates on char, we need to convert from Locale+UTF-16 to 8-bit C locale.
Most numbers are fairly short, but we can't rule out a long sequence of
leading zeroes, so we need something unbounded.
> And you could bet if I knew how much stack was available and how much it
> was safe to consume, that percentage would be even lower.
> I would say that these are valid use cases, size limits are unknown in
> advance, could be large, but statistically isn't; and dipping to the heap
> is disproportionally costly.
I don't think so. I'm skeptical about your benchmarking, but I also think that
you don't need the dynamic information in the first place. Knowing what type of
system you're running on, you can estimate what's a reasonable stack usage,
without needing to have been told the total.
Querying such value and making runtime decisions on which path to take will
introduce tens of cycles of latency in your code. CPUs (at least Intel ones)
aren't optimised for decision-making based on the RSP register's value.
You can benchmark this by trying to find the bottom of the stack using
/proc/self/maps (the main thread's is marked [stack]) and a bit of inline
assembly.
> It's awesome, I would definitely use it more in similar circumstances, had
> it been more practical to use and had it not been more unsafe the more you
> use it.
You still need to guesstimate how much of the stack the functions you call
will need. Suppose the current value says there are 7 MB of stack free - how
much can you use of that? 3.5? 6.5? 6.9375? What happens if the application
receives a signal at the worst possible time? And what happens if the SIGCHLD
handler that some library installed is poorly written and uses a lot of stack?
> > Good luck getting that, then. The C++ standard is not going to look into
> > making this a requirement unless it's already been made feasible in the
> > majority of environments.
>
> I believe it's worth the research, let's not be defeatist at the beginning
> of the journey because that problem may turn out not to be a problem. And
> if it works there's some interesting things we can do with it.
I'm not being defeatist. I am skeptical of the benefit, though.
And I am saying you'll need a proof of concept first and foremost, then pave
the way by implementing the runtime requirements it in the majority of OSes,
before the standard will take it up.
> > You're describing how alloca already works. Maybe you need to describe
> > here what you want to achieve, not the means by which you'd implement
> > it.
>
> I was just describing what you needed in order to avoid fragmentation and
> make it practical to resize both up and down. Which might need to change
> how certain implementations manages function level variables on the stack
> in order to make it work. Point being if that is how alloca already works
> (it got to work somehow with other stuff on the stack) then we don't really
> need to convince anybody to change anything because one of the requirements
> to make it work already is how it works. It immediately becomes much easier
> to sell.
Again, the compiler already knows how to do this. The mechanics of
"realloca()" used in a function are really no different from a repeated call to
a function using alloca().
> Exactly, that's why it feels that this information shouldn't be all that
> difficult to get. It looks achievable, it feels like we should be able to
> do this.
> That's why I want to give this idea a good effort.
No, it shouldn't be difficult to implement this. But someone has to do that and
it looks like that person is you.
> Well, this is a problem right now when using alloca. It feels like it should
> be a compiler defect, but when code gets inlined it does not account for
> the fact that you might have moved the stack pointer so it doesn't roll it
> back, the only available way to move back that pointer is by returning from
> the function and not just faking it. You can easily blow up the stack if
> you are not careful. gcc has this problem, and it is not considered a
> defect. That would have to change.
I didn't follow this portion.
> I have used alloca for example to implement a formatting library , the size
> of text being formatted is theoretically unbounded (or at least and
> impractical large number) in practice a few kilobytes will cover pretty
> much 100% of all practical cases. But because it could be larger, if it
> sees that it needs to much memory it goes to the heap instead of blowing
> the stack.
And because it's unbounded, you must have code to switch to the heap instead.
At that point, is it still faster having used the stack, rather than going
straight for the heap?
And do you need dynamic stack arrays? For example, we have QVarLengthArray in
Qt, which is a *static* sized stack array and it can grow to use the heap
instead.
> I've benchmarked it, and I'm doing a lot of crazy things like
> concatenating string, formatting doubles, taking timestamps converting them
> to a date and time and formatting that, putting everything together; just
> changing from using alloca to malloc added a whopping 20% to 25% increase
> in execution time. It's doing all sorts of expensive formatting and
> conversions, but that 1 call to malloc (and free) adds 25% to the cost of
> the entire thing.
On Linux? My experience is that malloc runs in about 20 cycles. That's still
far more than a stack allocation (< 1 cycle), but in absolute terms it isn't a
lot. So I'm surprised it's a large portion of your total. Formatting doubles
is going to be hundreds of cycles, for example.
> I have also used it to process low-latency event system, the size of the
> events was not predictable by the system I was building (but the instigator
> would tell me how much memory was needed), theoretically messages could be
> large but because of the physical limitations of the components being
> modeled less than 1Kb would ever be needed 99.9% of the time. That can sit
> comfortably in my stack, and you bet that is where I'm going to put it,
> there's backup if statement to dip into the heap just in case of that 0.1%
That's the usecase of QVarLengthArray. Implementing it does not require any of
what you've requested, seeing as we've had it for 20 years. For example, we
use it in the parsing numbers stored in QString: as std::from_chars only
operates on char, we need to convert from Locale+UTF-16 to 8-bit C locale.
Most numbers are fairly short, but we can't rule out a long sequence of
leading zeroes, so we need something unbounded.
> And you could bet if I knew how much stack was available and how much it
> was safe to consume, that percentage would be even lower.
> I would say that these are valid use cases, size limits are unknown in
> advance, could be large, but statistically isn't; and dipping to the heap
> is disproportionally costly.
I don't think so. I'm skeptical about your benchmarking, but I also think that
you don't need the dynamic information in the first place. Knowing what type of
system you're running on, you can estimate what's a reasonable stack usage,
without needing to have been told the total.
Querying such value and making runtime decisions on which path to take will
introduce tens of cycles of latency in your code. CPUs (at least Intel ones)
aren't optimised for decision-making based on the RSP register's value.
You can benchmark this by trying to find the bottom of the stack using
/proc/self/maps (the main thread's is marked [stack]) and a bit of inline
assembly.
> It's awesome, I would definitely use it more in similar circumstances, had
> it been more practical to use and had it not been more unsafe the more you
> use it.
You still need to guesstimate how much of the stack the functions you call
will need. Suppose the current value says there are 7 MB of stack free - how
much can you use of that? 3.5? 6.5? 6.9375? What happens if the application
receives a signal at the worst possible time? And what happens if the SIGCHLD
handler that some library installed is poorly written and uses a lot of stack?
> > Good luck getting that, then. The C++ standard is not going to look into
> > making this a requirement unless it's already been made feasible in the
> > majority of environments.
>
> I believe it's worth the research, let's not be defeatist at the beginning
> of the journey because that problem may turn out not to be a problem. And
> if it works there's some interesting things we can do with it.
I'm not being defeatist. I am skeptical of the benefit, though.
And I am saying you'll need a proof of concept first and foremost, then pave
the way by implementing the runtime requirements it in the majority of OSes,
before the standard will take it up.
> > You're describing how alloca already works. Maybe you need to describe
> > here what you want to achieve, not the means by which you'd implement
> > it.
>
> I was just describing what you needed in order to avoid fragmentation and
> make it practical to resize both up and down. Which might need to change
> how certain implementations manages function level variables on the stack
> in order to make it work. Point being if that is how alloca already works
> (it got to work somehow with other stuff on the stack) then we don't really
> need to convince anybody to change anything because one of the requirements
> to make it work already is how it works. It immediately becomes much easier
> to sell.
Again, the compiler already knows how to do this. The mechanics of
"realloca()" used in a function are really no different from a repeated call to
a function using alloca().
> Exactly, that's why it feels that this information shouldn't be all that
> difficult to get. It looks achievable, it feels like we should be able to
> do this.
> That's why I want to give this idea a good effort.
No, it shouldn't be difficult to implement this. But someone has to do that and
it looks like that person is you.
> Well, this is a problem right now when using alloca. It feels like it should
> be a compiler defect, but when code gets inlined it does not account for
> the fact that you might have moved the stack pointer so it doesn't roll it
> back, the only available way to move back that pointer is by returning from
> the function and not just faking it. You can easily blow up the stack if
> you are not careful. gcc has this problem, and it is not considered a
> defect. That would have to change.
I didn't follow this portion.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel DCAI Platform & System Engineering
Received on 2025-01-16 00:26:51