Date: Sat, 4 Apr 2026 00:00:34 +0500
Hi!
Thank you everyone for your feedbacks ❤️❤️❤️❤️❤️.
1.I have discussed it before that std::visit is inefficient/hard to
optimize and dosent convey intent and context.
2. This might not fit into the current type system but the alternative is
to write ugly switch cases again and again, or to rely on some form of
polymorphism or std::visit as you put it.
3. This can fit into the current thinking or model of the type system if we
make enough semantical rules to make any unsafe usage impossible and to fit
in. That is why I as anyone else would want to discuss this with experts
like you guys.
4. It does fit into the static type system, in the sense that it is meant
to capture values by const lvalue reference, and to produce code(just like
templates do) for each value being index or to do some jumps if the
compiler thinks that jumps are a better technique for that specfic usecase.
5. Std::visit is like a plain if statement, while my technique tries to add
an extra constexpr if statement, which makes it easier/gurrentied for the
compiler to optimize the branch.
Regards, Muneem
On Fri, 3 Apr 2026, 8:33 pm Thiago Macieira via Std-Proposals, <
std-proposals_at_[hidden]> wrote:
> On Thursday, 2 April 2026 19:15:42 Pacific Daylight Time Thiago Macieira
> via
> Std-Proposals wrote:
> > Even in this case, I have profiled the code above (after fixing it and
> > removing the std::cout itself) and found that overall, the switched case
> > ran 2x faster, at 0.113 ns per iteration, while the variant case required
> > 0.227 ns per iteration. Looking at the CPU performance counters, the
> > std::variant code has 2 branches per iteration and takes 1 cycle per
> > iteration, running at 5 IPC (thus, 5 instructions per iteration).
> > Meanwhile, the switched case has 0.5 branch per iteration and takes 0.5
> > cycle per iteration, running at 2 IPC. The half cycle numbers make sense
> > because I believe the two instructions are getting macrofused together
> and
> > execute as a single uop, which causes confusing numbers.
>
> This half a cycle and ninth of a nanosecond problem has been on my mind
> for a
> while. The execution time of anything needs to be a multiple of the cycle
> time, so a CPU running at 4.5 GHz line mine was shouldn't have a
> difference of
> one ninth of a nanosecond. One explanation would be that somehow the CPU
> was
> executing two iterations of the loop at the same time, pipelining.
>
> But disassembling the binary shows a simpler explanation. The switch loop
> was:
>
> 40149f: mov $0x3b9aca00,%eax
> 4014a4: nop
> 4014a5: data16 cs nopw 0x0(%rax,%rax,1)
> 4014b0: sub $0x2,%eax
> 4014b3: jne 4014b0
>
> [Note how there is no test for what was being indexed in the loop!]
>
> Here's what I had missed: sub $2. I'm not entirely certain what GCC was
> thinking here, but it's subtracting 2 instead of 1, so this looped half a
> billion times (0x3b9aca00 / 2). I suppose it decided to unroll the loop a
> bit
> and made two calls to sink() per loop:
>
> template <typename T> void sink(const T &) { asm volatile("" :::
> "memory"); }
>
> But that expanded to nothing in the output. I could add "nop" so we'd see
> what
> happened and the CPU would be obligated to retire those instructions,
> increasing the instruction executed counter (I can't quickly find how many
> the
> TGL processor / WLC core can retire per cycle, but I recall it's 6, so
> adding
> 2 more instructions shouldn't affect the execution time). But I don't
> think I
> need to further benchmark this to prove my point:
>
> The microbenchmark is misleading.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Data Center - Platform & Sys. Eng.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Thank you everyone for your feedbacks ❤️❤️❤️❤️❤️.
1.I have discussed it before that std::visit is inefficient/hard to
optimize and dosent convey intent and context.
2. This might not fit into the current type system but the alternative is
to write ugly switch cases again and again, or to rely on some form of
polymorphism or std::visit as you put it.
3. This can fit into the current thinking or model of the type system if we
make enough semantical rules to make any unsafe usage impossible and to fit
in. That is why I as anyone else would want to discuss this with experts
like you guys.
4. It does fit into the static type system, in the sense that it is meant
to capture values by const lvalue reference, and to produce code(just like
templates do) for each value being index or to do some jumps if the
compiler thinks that jumps are a better technique for that specfic usecase.
5. Std::visit is like a plain if statement, while my technique tries to add
an extra constexpr if statement, which makes it easier/gurrentied for the
compiler to optimize the branch.
Regards, Muneem
On Fri, 3 Apr 2026, 8:33 pm Thiago Macieira via Std-Proposals, <
std-proposals_at_[hidden]> wrote:
> On Thursday, 2 April 2026 19:15:42 Pacific Daylight Time Thiago Macieira
> via
> Std-Proposals wrote:
> > Even in this case, I have profiled the code above (after fixing it and
> > removing the std::cout itself) and found that overall, the switched case
> > ran 2x faster, at 0.113 ns per iteration, while the variant case required
> > 0.227 ns per iteration. Looking at the CPU performance counters, the
> > std::variant code has 2 branches per iteration and takes 1 cycle per
> > iteration, running at 5 IPC (thus, 5 instructions per iteration).
> > Meanwhile, the switched case has 0.5 branch per iteration and takes 0.5
> > cycle per iteration, running at 2 IPC. The half cycle numbers make sense
> > because I believe the two instructions are getting macrofused together
> and
> > execute as a single uop, which causes confusing numbers.
>
> This half a cycle and ninth of a nanosecond problem has been on my mind
> for a
> while. The execution time of anything needs to be a multiple of the cycle
> time, so a CPU running at 4.5 GHz line mine was shouldn't have a
> difference of
> one ninth of a nanosecond. One explanation would be that somehow the CPU
> was
> executing two iterations of the loop at the same time, pipelining.
>
> But disassembling the binary shows a simpler explanation. The switch loop
> was:
>
> 40149f: mov $0x3b9aca00,%eax
> 4014a4: nop
> 4014a5: data16 cs nopw 0x0(%rax,%rax,1)
> 4014b0: sub $0x2,%eax
> 4014b3: jne 4014b0
>
> [Note how there is no test for what was being indexed in the loop!]
>
> Here's what I had missed: sub $2. I'm not entirely certain what GCC was
> thinking here, but it's subtracting 2 instead of 1, so this looped half a
> billion times (0x3b9aca00 / 2). I suppose it decided to unroll the loop a
> bit
> and made two calls to sink() per loop:
>
> template <typename T> void sink(const T &) { asm volatile("" :::
> "memory"); }
>
> But that expanded to nothing in the output. I could add "nop" so we'd see
> what
> happened and the CPU would be obligated to retire those instructions,
> increasing the instruction executed counter (I can't quickly find how many
> the
> TGL processor / WLC core can retire per cycle, but I recall it's 6, so
> adding
> 2 more instructions shouldn't affect the execution time). But I don't
> think I
> need to further benchmark this to prove my point:
>
> The microbenchmark is misleading.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Data Center - Platform & Sys. Eng.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Received on 2026-04-03 19:00:50
