Date: Sat, 4 Apr 2026 00:48:18 +0500
Hi!
Thanks again for your feedback, Macieira. 👍
>micro benchmark is misleading
1. The reason that I gave you microbenchmarks is that some asked for it,
and even I was too relectunt to use them despite the quote of Bjarne
Stroustrups
"Don't assume, measure" because in this case, the goal is to either make
the compiler smaller or runtime faster, both of which are targeted by my
new proposal.
2. You are right that the compiler might have folded the loop into half,
but the point is that it still shows that the observable behaviour is the
same, infact, if the loop body was to index into a heterogeneous set(using
the proposed construct) and do some operation then the compiler would
optimize the indexing if the source of the index is one. This proves that
intent. An help the compiler do wonders:
1.Fold loops even when I used volatile to avoid it.
2.Avoid the entire indexing operations (if in a loop with the most minimal
compile time overhead)
3. Store the result immediately after it takes input into some memory
location (if that solution is the fasted).
3.Optimize a single expression for the sake of the whole program.
Currently, the optimizer might in fact be able to optimize checks in a
loop, but it's not as easy or as gurrentied because there are no semantical
promises that we can make with the existing constructs to make it happen.
4.My main point isn't weather my benchmark is correct or wrong, but rather
that expressing intent is better. The bench mark was merely to show that
std::visit is slower (according to g++ and Microsoft visual studio 2026
compiled programs, using std::chorno and visual studio 2026 CPU usage
measurement tools to prove my point), but even if some compiler or all
compilers optimize their performance; we still have compile time overhead
for taking std::visit and making it faster, and the optimization might
backfire since it would be to optimize single statements independent of
what's in the rest of the program. Why? Because unlike my proposed
construct, std::visit does not have enough context and intent to tell the
compiler what's going on so that it can generate code that has the exact
"book keeping" data and access code that fits the entire program.
3. In case, someone's think a few nano seconds in a single example isn't a
big deal, then rethink it because if my construct is passed then yes, it
would not be a big deal because the compiler can optimize many indexing
operations into a single heterogenous set and maybe cache the result
afterwards somewhere. The issue is that this can't be done with the current
techniques because of the lack of intent. Compilers are much smarter than
we could ever be because they are work of many people's entire career, not
just one very smart guy from Intel, so blaming/restricting compilers whose
job is to be as general for the sake of the whole program.
4.>I suppose it decided to unroll the loop a >bit
>and made two calls to sink() per loop:
>template <typename T> void sink(const T >&) { asm volatile("" :::
"memory"); }
Even if it optimized switch case statement using volatile("" :::
"memory"); but not std::visit
That's my point isn't that switch case is magically faster, but rather the
compiler has more room to cheat and skip things. Infact the standard allows
it a lot of free room as long as the observable behaviour is the same, even
more so by giving it free room with sets of observable behaviours
(unspecified behaviours)
5. Microbe marking wasent to show that std::visit is inherintly slower,
but rather the compiler can and should do mistakes in optimizing it, in
order to avoid massive compile time overhead.
On Fri, 3 Apr 2026, 8:33 pm Thiago Macieira via Std-Proposals, <
std-proposals_at_[hidden]> wrote:
> On Thursday, 2 April 2026 19:15:42 Pacific Daylight Time Thiago Macieira
> via
> Std-Proposals wrote:
> > Even in this case, I have profiled the code above (after fixing it and
> > removing the std::cout itself) and found that overall, the switched case
> > ran 2x faster, at 0.113 ns per iteration, while the variant case required
> > 0.227 ns per iteration. Looking at the CPU performance counters, the
> > std::variant code has 2 branches per iteration and takes 1 cycle per
> > iteration, running at 5 IPC (thus, 5 instructions per iteration).
> > Meanwhile, the switched case has 0.5 branch per iteration and takes 0.5
> > cycle per iteration, running at 2 IPC. The half cycle numbers make sense
> > because I believe the two instructions are getting macrofused together
> and
> > execute as a single uop, which causes confusing numbers.
>
> This half a cycle and ninth of a nanosecond problem has been on my mind
> for a
> while. The execution time of anything needs to be a multiple of the cycle
> time, so a CPU running at 4.5 GHz line mine was shouldn't have a
> difference of
> one ninth of a nanosecond. One explanation would be that somehow the CPU
> was
> executing two iterations of the loop at the same time, pipelining.
>
> But disassembling the binary shows a simpler explanation. The switch loop
> was:
>
> 40149f: mov $0x3b9aca00,%eax
> 4014a4: nop
> 4014a5: data16 cs nopw 0x0(%rax,%rax,1)
> 4014b0: sub $0x2,%eax
> 4014b3: jne 4014b0
>
> [Note how there is no test for what was being indexed in the loop!]
>
> Here's what I had missed: sub $2. I'm not entirely certain what GCC was
> thinking here, but it's subtracting 2 instead of 1, so this looped half a
> billion times (0x3b9aca00 / 2). I suppose it decided to unroll the loop a
> bit
> and made two calls to sink() per loop:
>
> template <typename T> void sink(const T &) { asm volatile("" :::
> "memory"); }
>
> But that expanded to nothing in the output. I could add "nop" so we'd see
> what
> happened and the CPU would be obligated to retire those instructions,
> increasing the instruction executed counter (I can't quickly find how many
> the
> TGL processor / WLC core can retire per cycle, but I recall it's 6, so
> adding
> 2 more instructions shouldn't affect the execution time). But I don't
> think I
> need to further benchmark this to prove my point:
>
> The microbenchmark is misleading.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Data Center - Platform & Sys. Eng.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Thanks again for your feedback, Macieira. 👍
>micro benchmark is misleading
1. The reason that I gave you microbenchmarks is that some asked for it,
and even I was too relectunt to use them despite the quote of Bjarne
Stroustrups
"Don't assume, measure" because in this case, the goal is to either make
the compiler smaller or runtime faster, both of which are targeted by my
new proposal.
2. You are right that the compiler might have folded the loop into half,
but the point is that it still shows that the observable behaviour is the
same, infact, if the loop body was to index into a heterogeneous set(using
the proposed construct) and do some operation then the compiler would
optimize the indexing if the source of the index is one. This proves that
intent. An help the compiler do wonders:
1.Fold loops even when I used volatile to avoid it.
2.Avoid the entire indexing operations (if in a loop with the most minimal
compile time overhead)
3. Store the result immediately after it takes input into some memory
location (if that solution is the fasted).
3.Optimize a single expression for the sake of the whole program.
Currently, the optimizer might in fact be able to optimize checks in a
loop, but it's not as easy or as gurrentied because there are no semantical
promises that we can make with the existing constructs to make it happen.
4.My main point isn't weather my benchmark is correct or wrong, but rather
that expressing intent is better. The bench mark was merely to show that
std::visit is slower (according to g++ and Microsoft visual studio 2026
compiled programs, using std::chorno and visual studio 2026 CPU usage
measurement tools to prove my point), but even if some compiler or all
compilers optimize their performance; we still have compile time overhead
for taking std::visit and making it faster, and the optimization might
backfire since it would be to optimize single statements independent of
what's in the rest of the program. Why? Because unlike my proposed
construct, std::visit does not have enough context and intent to tell the
compiler what's going on so that it can generate code that has the exact
"book keeping" data and access code that fits the entire program.
3. In case, someone's think a few nano seconds in a single example isn't a
big deal, then rethink it because if my construct is passed then yes, it
would not be a big deal because the compiler can optimize many indexing
operations into a single heterogenous set and maybe cache the result
afterwards somewhere. The issue is that this can't be done with the current
techniques because of the lack of intent. Compilers are much smarter than
we could ever be because they are work of many people's entire career, not
just one very smart guy from Intel, so blaming/restricting compilers whose
job is to be as general for the sake of the whole program.
4.>I suppose it decided to unroll the loop a >bit
>and made two calls to sink() per loop:
>template <typename T> void sink(const T >&) { asm volatile("" :::
"memory"); }
Even if it optimized switch case statement using volatile("" :::
"memory"); but not std::visit
That's my point isn't that switch case is magically faster, but rather the
compiler has more room to cheat and skip things. Infact the standard allows
it a lot of free room as long as the observable behaviour is the same, even
more so by giving it free room with sets of observable behaviours
(unspecified behaviours)
5. Microbe marking wasent to show that std::visit is inherintly slower,
but rather the compiler can and should do mistakes in optimizing it, in
order to avoid massive compile time overhead.
On Fri, 3 Apr 2026, 8:33 pm Thiago Macieira via Std-Proposals, <
std-proposals_at_[hidden]> wrote:
> On Thursday, 2 April 2026 19:15:42 Pacific Daylight Time Thiago Macieira
> via
> Std-Proposals wrote:
> > Even in this case, I have profiled the code above (after fixing it and
> > removing the std::cout itself) and found that overall, the switched case
> > ran 2x faster, at 0.113 ns per iteration, while the variant case required
> > 0.227 ns per iteration. Looking at the CPU performance counters, the
> > std::variant code has 2 branches per iteration and takes 1 cycle per
> > iteration, running at 5 IPC (thus, 5 instructions per iteration).
> > Meanwhile, the switched case has 0.5 branch per iteration and takes 0.5
> > cycle per iteration, running at 2 IPC. The half cycle numbers make sense
> > because I believe the two instructions are getting macrofused together
> and
> > execute as a single uop, which causes confusing numbers.
>
> This half a cycle and ninth of a nanosecond problem has been on my mind
> for a
> while. The execution time of anything needs to be a multiple of the cycle
> time, so a CPU running at 4.5 GHz line mine was shouldn't have a
> difference of
> one ninth of a nanosecond. One explanation would be that somehow the CPU
> was
> executing two iterations of the loop at the same time, pipelining.
>
> But disassembling the binary shows a simpler explanation. The switch loop
> was:
>
> 40149f: mov $0x3b9aca00,%eax
> 4014a4: nop
> 4014a5: data16 cs nopw 0x0(%rax,%rax,1)
> 4014b0: sub $0x2,%eax
> 4014b3: jne 4014b0
>
> [Note how there is no test for what was being indexed in the loop!]
>
> Here's what I had missed: sub $2. I'm not entirely certain what GCC was
> thinking here, but it's subtracting 2 instead of 1, so this looped half a
> billion times (0x3b9aca00 / 2). I suppose it decided to unroll the loop a
> bit
> and made two calls to sink() per loop:
>
> template <typename T> void sink(const T &) { asm volatile("" :::
> "memory"); }
>
> But that expanded to nothing in the output. I could add "nop" so we'd see
> what
> happened and the CPU would be obligated to retire those instructions,
> increasing the instruction executed counter (I can't quickly find how many
> the
> TGL processor / WLC core can retire per cycle, but I recall it's 6, so
> adding
> 2 more instructions shouldn't affect the execution time). But I don't
> think I
> need to further benchmark this to prove my point:
>
> The microbenchmark is misleading.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Data Center - Platform & Sys. Eng.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Received on 2026-04-03 19:48:33
