Date: Wed, 30 Apr 2025 18:08:18 -0700
On Wednesday, 30 April 2025 15:58:35 Pacific Daylight Time Tiago Freire via
Std-Proposals wrote:
> Here’s a benchmark.
> Here are the results: using an AMD 7950X
>
> MSVC on Windows:
> func<div_ub> 324 ns 322 ns 2133333
> func<div_check> 324 ns 328 ns 2240000
> func<div_throw> 336 ns 338 ns 2036364
> GCC on WSL:
> func<div_ub> 324 ns 322 ns 2133333
> func<div_check> 324 ns 328 ns 2240000
> func<div_throw> 336 ns 338 ns 2036364
>
> In other words. It makes almost no difference whatsoever, the exception one
> is slightly worse (but still not significant).
Your div_throw function is identical to div_ub because the try block can't
throw exceptions.
Modifying your test so we can get low-level performance counter data using
perf stat -r10, where each run is 1024*1024 loops of the divisions in your
func() function:
Performance counter stats for './a.out ub' (10 runs):
1,564.93 msec task-clock:u
0 context-switches:u
0 cpu-migrations:u
128 page-faults:u
6,850,020,684 cycles:u
1,432,032,452 instructions:u
18,148,262 branches:u
16,740 branch-misses:u # 0.09% of all branches
Performance counter stats for './a.out check' (10 runs):
1,618.02 msec task-clock:u
0 context-switches:u
0 cpu-migrations:u
129 page-faults:u
7,084,607,943 cycles:u
2,037,060,870 instructions:u
270,855,134 branches:u
16,760 branch-misses:u
The UB code runs at 0.21 instructions per cycle, while the checking one
manages 0.29, which is a horrible number for anyone doing optimisations.
That's completely expected considering we're talking about the single worst
non-memory instruction in the x86 processor: 64-bit integer division. This
particular instruction has a problem that it pipelines poorly, which explains
the reciprocal measurement: it's taking 4 to 5 cycles to run each instruction
in this application. That means the processor has a lot of spare cycles to run
speculatively ahead of the division and check that the next dozens of input
numbers aren't zero. I can't say though whether this is a good or bad micro-
benchmark. It's not testing a hypothesis of the OP, which is that this extra
check would pollute the BPU used by everything else in the application.
What it does show is that, in spite of adding 252.7k branches and nearly 600
million instructions, the execution time only increased by 53 milliseconds, in
a 1M loop run. Each loop has 16 loops of 16 divisions, so the marginal cost of
the added check was 198 picoseconds per division. The branch-miss counter
change is negligible.
Another way to view this is that it added 234M cycles in 268M divisions, so an
average of 0.874 extra cycles.
I did check that div_check and div_ub did emit and not emit a branch,
respectively:
1593: test %r15,%r15
1596: je 1731 <void func<&(div_check(long long, long long))>()
+0x281>
159c: mov %rcx,%rax
159f: cqto
15a1: idiv %r15
15a4: add %rax,%rdx
15a7: add %rdx,%rsi
15aa: test %r14,%r14
15ad: je 1731 <void func<&(div_check(long long, long long))>()
+0x281>
15b3: mov %rcx,%rax
15b6: cqto
15b8: idiv %r14
15bb: add %rax,%rdx
15be: add %rdx,%rsi
and
13f4: cqto
13f6: idiv %r14
13f9: add %rdx,%rax
13fc: add %rax,%rsi
13ff: mov %rcx,%rax
1402: cqto
1404: idiv %r13
1407: add %rdx,%rax
140a: add %rax,%rsi
140d: mov %rcx,%rax
Std-Proposals wrote:
> Here’s a benchmark.
> Here are the results: using an AMD 7950X
>
> MSVC on Windows:
> func<div_ub> 324 ns 322 ns 2133333
> func<div_check> 324 ns 328 ns 2240000
> func<div_throw> 336 ns 338 ns 2036364
> GCC on WSL:
> func<div_ub> 324 ns 322 ns 2133333
> func<div_check> 324 ns 328 ns 2240000
> func<div_throw> 336 ns 338 ns 2036364
>
> In other words. It makes almost no difference whatsoever, the exception one
> is slightly worse (but still not significant).
Your div_throw function is identical to div_ub because the try block can't
throw exceptions.
Modifying your test so we can get low-level performance counter data using
perf stat -r10, where each run is 1024*1024 loops of the divisions in your
func() function:
Performance counter stats for './a.out ub' (10 runs):
1,564.93 msec task-clock:u
0 context-switches:u
0 cpu-migrations:u
128 page-faults:u
6,850,020,684 cycles:u
1,432,032,452 instructions:u
18,148,262 branches:u
16,740 branch-misses:u # 0.09% of all branches
Performance counter stats for './a.out check' (10 runs):
1,618.02 msec task-clock:u
0 context-switches:u
0 cpu-migrations:u
129 page-faults:u
7,084,607,943 cycles:u
2,037,060,870 instructions:u
270,855,134 branches:u
16,760 branch-misses:u
The UB code runs at 0.21 instructions per cycle, while the checking one
manages 0.29, which is a horrible number for anyone doing optimisations.
That's completely expected considering we're talking about the single worst
non-memory instruction in the x86 processor: 64-bit integer division. This
particular instruction has a problem that it pipelines poorly, which explains
the reciprocal measurement: it's taking 4 to 5 cycles to run each instruction
in this application. That means the processor has a lot of spare cycles to run
speculatively ahead of the division and check that the next dozens of input
numbers aren't zero. I can't say though whether this is a good or bad micro-
benchmark. It's not testing a hypothesis of the OP, which is that this extra
check would pollute the BPU used by everything else in the application.
What it does show is that, in spite of adding 252.7k branches and nearly 600
million instructions, the execution time only increased by 53 milliseconds, in
a 1M loop run. Each loop has 16 loops of 16 divisions, so the marginal cost of
the added check was 198 picoseconds per division. The branch-miss counter
change is negligible.
Another way to view this is that it added 234M cycles in 268M divisions, so an
average of 0.874 extra cycles.
I did check that div_check and div_ub did emit and not emit a branch,
respectively:
1593: test %r15,%r15
1596: je 1731 <void func<&(div_check(long long, long long))>()
+0x281>
159c: mov %rcx,%rax
159f: cqto
15a1: idiv %r15
15a4: add %rax,%rdx
15a7: add %rdx,%rsi
15aa: test %r14,%r14
15ad: je 1731 <void func<&(div_check(long long, long long))>()
+0x281>
15b3: mov %rcx,%rax
15b6: cqto
15b8: idiv %r14
15bb: add %rax,%rdx
15be: add %rdx,%rsi
and
13f4: cqto
13f6: idiv %r14
13f9: add %rdx,%rax
13fc: add %rax,%rsi
13ff: mov %rcx,%rax
1402: cqto
1404: idiv %r13
1407: add %rdx,%rax
140a: add %rax,%rsi
140d: mov %rcx,%rax
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel DCAI Platform & System Engineering
Received on 2025-05-01 01:08:21