Hello,
I would like to provide better benchmarking numbers for the effects of converting all contracts in P2900 to assume statements. The executive summary is that I had mixed results with Eigen and could argue this is a bug in clang's optimizer.
This is not a proposal to modify P2900. Specifically this is listed in P2900 as something not proposed: 2.3 Features Not Proposed: The ability to assume that an unchecked contract predicate would evaluate to true and to allow the compiler to optimize based on that assumption, i.e., the assume semantic.
To provide a benchmark, my immediate concern was finding a representative project. Most well known high-performance projects already use __builtin_unreachable() (a.k.a. [[assume(false)]]) and it didn't seem too credible to experiment on something that has already been carefully instrumented that way.
Meanwhile P2646R0, the paper that successfully lobbied for the inclusion of [[assume]], used a synthetic benchmark that is not representative of widespread use across a non-trivial program either.
In the end, I settled on the Eigen math library because it is well known, has no assume statements and has a benchmark. The results were mixed and not too encouraging. Eigen may not have been an ideal test subject either as it has been carefully optimized too. Please consider this a request for comments, as I would prefer a benchmark that would be easily recognized as a good example of this technique being applied to a "normal, well written, performance sensitive application."
These numbers were acquired by converting all asserts to __builtin_assume() and then using AI to add some more. The command line was: clang++-18 -O3 -march=native. This is AMD with AVX-512, AVX2/FMA.
Largest improvements (assume is faster):
- bench_gemm float: nearly all cases 1-9% faster, up to -8.53% at 192x192
- TRMV_float_Lower/512: -8.46%
- VectorNorm_double/65536: -14.01%
- VectorMinCoeff_float/4096: -7.59%, /262144: -7.43%
Largest regressions (assume is slower):
- BlockRead_float/512/64: +78.29% - extreme outlier, possible cache effect
- TRSV_float_Lower/512: +18.78%, /128: +17.61% - consistent, likely real
- TRMV_float_UnitUpper/1024: +18.89%, TRMV_float_Upper/1024: +18.51%
- TRMV_double_UnitUpper/2048: +15.63%
- Dot_cfloat/1048576: +15.76%, Dot_double/65536: +15.78%
- VectorLpNormInf_double/1048576: +20.62% - worst case overall
The assume hints help float GEMM consistently but hurt triangular solvers (TRSV/TRMV Upper) and some large-vector reductions significantly. The __builtin_assume hints appear to misguide the compiler's vectorization decisions for certain upper-triangular traversal patterns and large strided reductions.
The reality is that Eigen has use cases that are evidence of clang using optimization hints to deoptimize a program instead. Arguably this is a bug in clang's optimizer. This means it is advisable to keep an eye on your generated assembly and timing numbers as you add assume hints. I have seen some reports from embedded systems programmers saying that __builtin_unreachable() is extremely important to them for reducing code size. And the widespread use of __builtin_unreachable() also shows their importance. However, it is hard to encourage wholesale global use today based on these numbers. Attached is the full benchmark data.
Again, if you have a suggestion for a benchmark that is not already optimized with assume semantics, is open source and would be particularly credible as an example of whole program optimization, let me know.
All the best,
Adrian Johnston