Date: Fri, 3 Apr 2026 03:12:44 +0500
Hi, again!
> Interesting compiler veven remove all warpers and only emit calls to
`std::cout`
In my benchmark, I used std::cin and volatile x to make sure that any
compile time indexing dosent happen!
I completely agree with your point on cache locality, infact, that's the
whole point of the construct, so the compiler has as much context as
possible so it can make code cache locale. I know benchmarks can lie, but
assembly lies more because compilers can produce the same assembly for two
functions that are only slightly slower, but small differences can have
huge implications on cache performance. I completely agree with you that
cache performance matters more than anything, and my example shows that the
compiler has context and intent that I want a contiguous allocated
structure that I can index through. Functions calls are worse because they
dont speak enough intent and not enough context, meanwhile my proposal will
give you context from many sources and the clearest intent.
These intent based constructs are useful so that the compiler can do
exactly what you said "deal with cache better". Benchmarks show that even
in small code, compilers fall short when it comes to optimizing switch case
statements and std::visits. Benchmarks can lie, but in my case it shows
that if we have proper syntax constructs for these features, then we can in
fact make it cache friendly because the compiler has the exact map of
intent and context. It can basically measure a lot to increase performance.
thanks again for your feedback
regards, Muneem
On Fri, Apr 3, 2026 at 2:57 AM Marcin Jaczewski <marcinjaczewski86_at_[hidden]>
wrote:
> Benchmarks do lie...
>
> The first rule of benchmarking is not testing real code and this
> creates biases that screw
> results, considering simple things like CPU cache. Benchmark can
> easily fill it full and run at 100% speed.
> but in real code the same logic has only 0.1% cache available to it as
> other code needs the same cache too.
> If you have cache miss, the whole code run 100 times slower.
>
> Only test final optimized assembly are true test of permanence,
> benchmarks are only hits for searching optimal solutions.
>
>
> And for your example, why not simply use get function pointer like
> `&A::f` and that call this function directly?
> Why bother with indexes?
>
> ```
> #include <iostream>
> #include <tuple>
> #include <functional>
>
> struct A { int get() { return 10; } };
> struct B { double get() { return 1; } };
> struct C { float get() { return 2.1; } };
>
> template<typename C>
> struct GetCallerArg;
> template<typename T, typename R>
> struct GetCallerArg<R (T::*)()>
> {
> using type = T;
> };
>
> template<typename T, auto Callback>
> void warper(T& a)
> {
> std::cout<< std::invoke(
> Callback,
> std::get<typename GetCallerArg<decltype(Callback)>::type>(a)
> ) << '\n';
> }
> template<typename T>
> using Access = void(*)(T&);
>
>
>
> int main()
> {
> std::tuple<A,B,C> t{ A{}, B{}, C{} };
>
> Access<decltype(t)> f = &warper<decltype(t), &A::get>;
>
> f(t);
>
> f = &warper<decltype(t), &B::get>;
>
> f(t);
> }
> ```
> https://godbolt.org/z/Y5fGEMdc4
>
> Intresing compiler veven remove all warpers and only emit calls to
> `std::cout`
>
> czw., 2 kwi 2026 o 23:20 Muneem via Std-Proposals
> <std-proposals_at_[hidden]> napisaĆ(a):
> >
> > benchmarks don't lie! Even if the assembly is the same size and looks
> similar, benchmarks show otherwise. Do benchmarking in the code that I
> showed using my proposal.
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden]
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
> Interesting compiler veven remove all warpers and only emit calls to
`std::cout`
In my benchmark, I used std::cin and volatile x to make sure that any
compile time indexing dosent happen!
I completely agree with your point on cache locality, infact, that's the
whole point of the construct, so the compiler has as much context as
possible so it can make code cache locale. I know benchmarks can lie, but
assembly lies more because compilers can produce the same assembly for two
functions that are only slightly slower, but small differences can have
huge implications on cache performance. I completely agree with you that
cache performance matters more than anything, and my example shows that the
compiler has context and intent that I want a contiguous allocated
structure that I can index through. Functions calls are worse because they
dont speak enough intent and not enough context, meanwhile my proposal will
give you context from many sources and the clearest intent.
These intent based constructs are useful so that the compiler can do
exactly what you said "deal with cache better". Benchmarks show that even
in small code, compilers fall short when it comes to optimizing switch case
statements and std::visits. Benchmarks can lie, but in my case it shows
that if we have proper syntax constructs for these features, then we can in
fact make it cache friendly because the compiler has the exact map of
intent and context. It can basically measure a lot to increase performance.
thanks again for your feedback
regards, Muneem
On Fri, Apr 3, 2026 at 2:57 AM Marcin Jaczewski <marcinjaczewski86_at_[hidden]>
wrote:
> Benchmarks do lie...
>
> The first rule of benchmarking is not testing real code and this
> creates biases that screw
> results, considering simple things like CPU cache. Benchmark can
> easily fill it full and run at 100% speed.
> but in real code the same logic has only 0.1% cache available to it as
> other code needs the same cache too.
> If you have cache miss, the whole code run 100 times slower.
>
> Only test final optimized assembly are true test of permanence,
> benchmarks are only hits for searching optimal solutions.
>
>
> And for your example, why not simply use get function pointer like
> `&A::f` and that call this function directly?
> Why bother with indexes?
>
> ```
> #include <iostream>
> #include <tuple>
> #include <functional>
>
> struct A { int get() { return 10; } };
> struct B { double get() { return 1; } };
> struct C { float get() { return 2.1; } };
>
> template<typename C>
> struct GetCallerArg;
> template<typename T, typename R>
> struct GetCallerArg<R (T::*)()>
> {
> using type = T;
> };
>
> template<typename T, auto Callback>
> void warper(T& a)
> {
> std::cout<< std::invoke(
> Callback,
> std::get<typename GetCallerArg<decltype(Callback)>::type>(a)
> ) << '\n';
> }
> template<typename T>
> using Access = void(*)(T&);
>
>
>
> int main()
> {
> std::tuple<A,B,C> t{ A{}, B{}, C{} };
>
> Access<decltype(t)> f = &warper<decltype(t), &A::get>;
>
> f(t);
>
> f = &warper<decltype(t), &B::get>;
>
> f(t);
> }
> ```
> https://godbolt.org/z/Y5fGEMdc4
>
> Intresing compiler veven remove all warpers and only emit calls to
> `std::cout`
>
> czw., 2 kwi 2026 o 23:20 Muneem via Std-Proposals
> <std-proposals_at_[hidden]> napisaĆ(a):
> >
> > benchmarks don't lie! Even if the assembly is the same size and looks
> similar, benchmarks show otherwise. Do benchmarking in the code that I
> showed using my proposal.
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden]
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Received on 2026-04-02 22:13:01
