I'm pretty sure about my non-trivial vs. trivial numbers. Here's an older set of runs that I had in a draft version of the paper:
You can see that the general shape of the graph is the same, but the means and medians are slightly different.
Take a look at the difference in assembly between trivial and non-trivial with the version of gcc I used:
The general idea is that a local destructor really makes moving things in registers miserable. You get things in registers, but you immediately need to move them somewhere else because the destructor you are about to call would clobber those values. Then
you have to move them back so that you can return them. If you have a non-trivial object, then it gets RVO'd on the stack, and you can just pass that pointer along, and there's very little shuffling of registers or data. Lots of code will need to read that
top element of the stack to check if it is an error, but that value is going to be super hot in the cache.
FWIW, finally managed to take a look at this - even with the above said this is still surprising:
- stack can be as hot as the sun it is still at best in L1 which is 4-5 cycles of latency on x86 and ARM (true, x86 is perhaps able to circumvent this with store-load forwarding in this simple scenario...don't know if ARM has this feature) while the trivial/in-register version involves only one register2register-move extra
- on MSVC (due to its broken x64 ABI) both versions go through the stack yet the difference is still there (albeit smaller)
I'd guess that (despite the really laudable effort you put into this!) coming up with a 'real world' happy path benchmark is really difficult (unlike the sad path, which pretty much gives us the expected results)...