ISOCPP std-proposals List: Re: [std-proposals] Fwd: Extension to runtime polymorphism proposed

From: Thiago Macieira <thiago_at_[hidden]>
Date: Thu, 02 Apr 2026 19:15:42 -0700

On Thursday, 2 April 2026 16:53:30 Pacific Daylight Time Muneem via Std-
Proposals wrote:
> struct data_accessed_through_visit {
> static std::variant<A, B, C> obj;
>
> inline void operator()(int) {
> std::visit([](auto&& arg) {
> std::cout<< arg.get();
> }, obj);
> }
> };
> //proves that std visit is too verbose and slow (too hard/impractical to
> optimize) for this task
> what about switches, enjoy the look for yourself(while imagining the look
> if this were repeated multiple times over hundreds of case statements):
> struct data_switched {
> inline void operator()(int index) {
> switch(index) {
> case 0: std::cout<<1;
> case 1: std::cout<<"love C++";
> case 2: std::cout<<2;
> default: return -1;
> }
> }
> };

This is a very misleading benchmark because it's not comparing apples to
apples. It also doesn't compile and is missing breaks. Please paste your code
to Godbolt instead of typing here.

In the case of std::variant above, the quantity being tested is stored in the
static variable (a global). As you're looping over the call operator function,
the compiler is constantly reloading the index of the current type and making
a runtime decision on it. It can't cache this value because, as a global, it
could conceivably be modified by the std::cout call.

In the case of the switched, the quantity being tested is local to the
function and never escapes. It's provided by the caller at the time of call.
So the compiler knows it's a constant and inverts the looping: instead of 1x
loop over 3 possibilities of std::cout calls, it has 3x loops of 1 possibility
each.

Now, this may be what you're trying to get across to us that needs to be
improved. If so, you have failed at communicating that. More importantly, you
have failed to describe what std::variant is being used here for. This is not
a typical use.

Even in this case, I have profiled the code above (after fixing it and removing
the std::cout itself) and found that overall, the switched case ran 2x faster,
at 0.113 ns per iteration, while the variant case required 0.227 ns per
iteration. Looking at the CPU performance counters, the std::variant code has
2 branches per iteration and takes 1 cycle per iteration, running at 5 IPC
(thus, 5 instructions per iteration). Meanwhile, the switched case has 0.5
branch per iteration and takes 0.5 cycle per iteration, running at 2 IPC. The
half cycle numbers make sense because I believe the two instructions are
getting macrofused together and execute as a single uop, which causes
confusing numbers.

In other words, this microbenchmark is pointless. It showed that the extra
cost is about one ninth of a nanosecond on my system. The number of
mispredicted branches was indistinguishable from zero in both, which is
unreasonable for real-world cases.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Principal Engineer - Intel Data Center - Platform & Sys. Eng.

Received on 2026-04-03 02:15:51