ISOCPP std-proposals List: Re: [std-proposals] Efficient and silent bounds checking with silent

From: Thiago Macieira <thiago_at_[hidden]>
Date: Wed, 05 Jul 2023 21:37:15 -0700

On Wednesday, 5 July 2023 20:23:51 PDT unlvsur unlvsur via Std-Proposals
wrote:
> Throwing EH requires additional register allocation and EH runtime calls:
> When an exception is thrown, the EH mechanism needs to allocate additional
> registers to handle EH-related data and make calls to EH runtime functions,
> such as __cxa_throw. These additional operations introduce overhead that
> can impact performance.

That's a completely irrelevant point.

If you're willing to allow the application to crash if the precondition wasn't
satisfied, it implies you only care about performance when it *was* satisfied.

As I said before, the crashing of an application triggers core dumping, which
will launch systemd-coredumpd to capture the core dump, then compress with xz
(a very slow compressor). I've experienced severe system slowdowns in those
conditions, with xz taking tens of seconds using 100% of all my cores to finish
its job.

> The EH path hampers optimizations and creates code bloat:

"code bloat" is a subjective quality statement. Yes, there's a need to emit
some code somewhere to throw the exception, but said code need not
meaningfully or even measurably impact your performance. In fact, your own
examples prove my point:

> silent_at:
> https://godbolt.org/z/vrjY4evno
> EH:
> https://godbolt.org/z/qbTPo19df
>
> You can see EH version generates 8x more instructions than unchecked
> version. 3x more compare to silent_at.
>
> In these examples, you can observe the significant increase in instructions
> when EH is employed for bounds checking. This increase can lead to
> performance degradation and hinder optimization efforts.

I guess you haven't actually benchmarked your own code to prove the point
you're trying to make yet, because you'll find out that this increase -- at
least in this example -- will not lead to performance degradation or hinder
optimisation efforts.

That's because the grand total number of instructions is a bad metric. What
matters for performance is the main portion of the function (the non ".cold"
portion) and that's effectively identical. Both hot paths are exactly 12 uops
(most of which will be dispatched in parallel), not including the function
call and return sequence.

If you stop filtering the directives for your output, you'll see:
        .section .text.unlikely
        .cfi_startproc
        .type _Z3foo6myspanImE.cold, @function
_Z3foo6myspanImE.cold:

This means that the contents of the ".cold" portion were moved off to a
different section of the binary, so there shouldn't even be an impact to the
instruction cache hit/miss ratio. The binary itself is larger, but the page(s)
with the exception-throwing code may not even be faulted in from disk in a
regular run of the application, meaning you're not even adding to memory
usage.

I'll grant you that this is a micro-example and thus may not be representative
of real-world use-cases. That's a fair argument to make, but if you want to
make that, then show real-world benchmarks.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DCAI Cloud Engineering

Received on 2023-07-06 04:37:17