C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Fwd: Extension to runtime polymorphism proposed

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 03 Apr 2026 08:33:15 -0700
On Thursday, 2 April 2026 19:15:42 Pacific Daylight Time Thiago Macieira via
Std-Proposals wrote:
> Even in this case, I have profiled the code above (after fixing it and
> removing the std::cout itself) and found that overall, the switched case
> ran 2x faster, at 0.113 ns per iteration, while the variant case required
> 0.227 ns per iteration. Looking at the CPU performance counters, the
> std::variant code has 2 branches per iteration and takes 1 cycle per
> iteration, running at 5 IPC (thus, 5 instructions per iteration).
> Meanwhile, the switched case has 0.5 branch per iteration and takes 0.5
> cycle per iteration, running at 2 IPC. The half cycle numbers make sense
> because I believe the two instructions are getting macrofused together and
> execute as a single uop, which causes confusing numbers.

This half a cycle and ninth of a nanosecond problem has been on my mind for a
while. The execution time of anything needs to be a multiple of the cycle
time, so a CPU running at 4.5 GHz line mine was shouldn't have a difference of
one ninth of a nanosecond. One explanation would be that somehow the CPU was
executing two iterations of the loop at the same time, pipelining.

But disassembling the binary shows a simpler explanation. The switch loop was:

  40149f: mov $0x3b9aca00,%eax
  4014a4: nop
  4014a5: data16 cs nopw 0x0(%rax,%rax,1)
  4014b0: sub $0x2,%eax
  4014b3: jne 4014b0

[Note how there is no test for what was being indexed in the loop!]

Here's what I had missed: sub $2. I'm not entirely certain what GCC was
thinking here, but it's subtracting 2 instead of 1, so this looped half a
billion times (0x3b9aca00 / 2). I suppose it decided to unroll the loop a bit
and made two calls to sink() per loop:

template <typename T> void sink(const T &) { asm volatile("" ::: "memory"); }

But that expanded to nothing in the output. I could add "nop" so we'd see what
happened and the CPU would be obligated to retire those instructions,
increasing the instruction executed counter (I can't quickly find how many the
TGL processor / WLC core can retire per cycle, but I recall it's 6, so adding
2 more instructions shouldn't affect the execution time). But I don't think I
need to further benchmark this to prove my point:

The microbenchmark is misleading.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Principal Engineer - Intel Data Center - Platform & Sys. Eng.

Received on 2026-04-03 15:33:29