Date: Fri, 17 Jan 2025 07:50:15 -0800
On Thursday 16 January 2025 23:26:03 Pacific Standard Time Tiago Freire wrote:
> > You still need to guesstimate how much of the stack the functions you call
> > will need. Suppose the current value says there are 7 MB of stack free -
> > how much can you use of that? 3.5? 6.5? 6.9375? What happens if the
> > application receives a signal at the worst possible time? And what
> > happens if the SIGCHLD handler that some library installed is poorly
> > written and uses a lot of stack?
> This is a problem that hasn't gone unnoticed.
> Do developers in general even know right now how much stack is needed to run
> their own applications? Other than on a very rare occasion, their large
> application overflows the stack and they just go back to their code and
> increase that ceiling hopping it is going to be enough.
In general, the stack is big enough for any non-recursive algorithm, so long
as you don't alloca() or VLA a size you don't control. The most common case of
stack overflow remains that of recursive algorithms. We recently had one such
case in Qt, when someone was trying to serialise a very-deeply-nested XML
document using QDomDocument and, for some reason we haven't bothered to
investigate, MSVC was generating very large stack frames. We had to rewrite
the algorithm to use the heap instead of the stack.
Knowing the size of the stack beforehand wouldn't have helped because however
big it is, it's always going to be smaller than the available heap, so it
would still impose a lower limit. It also depends on the whims on the compiler
in creating the stack frame, whereas the heap overhead is usually much more
tightly controlled.
And using a hybrid approach wasn't acceptable, because then we'd need to keep
two sets of non-trivial serialisation algorithms, each with their maintenance
requirements and possible subtle incompatibilities. It would make for very
difficult unit-testing too. This is what I meant when I said that if you don't
have a reasonable upper bound in the size, you need to use the heap anyway.
> > I didn't follow this portion.
>
> It's a known issue when using alloca that if the function that uses alloca
> is inlined, the supposed caller of the function does not get their stack
> rewound. And if "invoked" multiple times (like in a loop) the amount of
> used stack will increasingly increase until it overflows. And this is not
> considered a bug. You must mark functions that used alloca as not
> inlineable.
> I think clang handles this better, but it is a know issue if you are writing
> a library.
I don't see what that has to do with a library.
This is a limitation in the compilers, that they won't inline an alloca()-
using function (if the size is dynamic, at least). The sched_getaffinity()
example shows it: https://gcc.godbolt.org/z/Kd1vcMjGd. Neither Clang, nor GCC
nor MSVC inline the looped call, though Clang did inline the first call, out of
the loop and called with a fixed size. But it didn't unwind the stack before
entering the loop - not that a mere 256 bytes would make much of a difference
anyway.
Note how the old ICC did inline (see the block starting at B1.8). If I switch
to a VLA, now GCC does inline too: https://gcc.godbolt.org/z/e9z1efn46. This
shows there is no theoretical limitation to inlining even in a loop, only
missed optimisations.
How important is this? I would still put it at a low priority. The fact that
you're growing the memory instead of just starting with a reasonably big value
that would almost always work indicates there's a cost associated with the
bigger buffer that you'd rather not pay (indeed, in this case, the buffer must
be memset - see [1]). That means the cost of the function call is going to be
lost amid the noise of whatever other overhead you have. Not to mention you're
placing a call or two anyway. And in this case, we also have a transition to
kernel mode, with all the involved state-clearing to avoid side-channel
attacks.
Previously, a function call would also disable running looped code from the
decode micro-op cache in Intel processors (the Loop Stream Detector). That's
no longer the case for current processors, but I don't remember how long it's
been (definitely less than 4 years).
[1] https://codebrowser.dev/glibc/glibc/sysdeps/unix/sysv/linux/
sched_getaffinity.c.html
> > You still need to guesstimate how much of the stack the functions you call
> > will need. Suppose the current value says there are 7 MB of stack free -
> > how much can you use of that? 3.5? 6.5? 6.9375? What happens if the
> > application receives a signal at the worst possible time? And what
> > happens if the SIGCHLD handler that some library installed is poorly
> > written and uses a lot of stack?
> This is a problem that hasn't gone unnoticed.
> Do developers in general even know right now how much stack is needed to run
> their own applications? Other than on a very rare occasion, their large
> application overflows the stack and they just go back to their code and
> increase that ceiling hopping it is going to be enough.
In general, the stack is big enough for any non-recursive algorithm, so long
as you don't alloca() or VLA a size you don't control. The most common case of
stack overflow remains that of recursive algorithms. We recently had one such
case in Qt, when someone was trying to serialise a very-deeply-nested XML
document using QDomDocument and, for some reason we haven't bothered to
investigate, MSVC was generating very large stack frames. We had to rewrite
the algorithm to use the heap instead of the stack.
Knowing the size of the stack beforehand wouldn't have helped because however
big it is, it's always going to be smaller than the available heap, so it
would still impose a lower limit. It also depends on the whims on the compiler
in creating the stack frame, whereas the heap overhead is usually much more
tightly controlled.
And using a hybrid approach wasn't acceptable, because then we'd need to keep
two sets of non-trivial serialisation algorithms, each with their maintenance
requirements and possible subtle incompatibilities. It would make for very
difficult unit-testing too. This is what I meant when I said that if you don't
have a reasonable upper bound in the size, you need to use the heap anyway.
> > I didn't follow this portion.
>
> It's a known issue when using alloca that if the function that uses alloca
> is inlined, the supposed caller of the function does not get their stack
> rewound. And if "invoked" multiple times (like in a loop) the amount of
> used stack will increasingly increase until it overflows. And this is not
> considered a bug. You must mark functions that used alloca as not
> inlineable.
> I think clang handles this better, but it is a know issue if you are writing
> a library.
I don't see what that has to do with a library.
This is a limitation in the compilers, that they won't inline an alloca()-
using function (if the size is dynamic, at least). The sched_getaffinity()
example shows it: https://gcc.godbolt.org/z/Kd1vcMjGd. Neither Clang, nor GCC
nor MSVC inline the looped call, though Clang did inline the first call, out of
the loop and called with a fixed size. But it didn't unwind the stack before
entering the loop - not that a mere 256 bytes would make much of a difference
anyway.
Note how the old ICC did inline (see the block starting at B1.8). If I switch
to a VLA, now GCC does inline too: https://gcc.godbolt.org/z/e9z1efn46. This
shows there is no theoretical limitation to inlining even in a loop, only
missed optimisations.
How important is this? I would still put it at a low priority. The fact that
you're growing the memory instead of just starting with a reasonably big value
that would almost always work indicates there's a cost associated with the
bigger buffer that you'd rather not pay (indeed, in this case, the buffer must
be memset - see [1]). That means the cost of the function call is going to be
lost amid the noise of whatever other overhead you have. Not to mention you're
placing a call or two anyway. And in this case, we also have a transition to
kernel mode, with all the involved state-clearing to avoid side-channel
attacks.
Previously, a function call would also disable running looped code from the
decode micro-op cache in Intel processors (the Loop Stream Detector). That's
no longer the case for current processors, but I don't remember how long it's
been (definitely less than 4 years).
[1] https://codebrowser.dev/glibc/glibc/sysdeps/unix/sysv/linux/
sched_getaffinity.c.html
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel DCAI Platform & System Engineering
Received on 2025-01-17 15:50:23