C++ Logo

std-proposals

Advanced search

Re: [std-proposals] 回复: 回复: Proposal: Integer division/remainder/division-remainder functions with exception throwing behavior

From: Thiago Macieira <thiago_at_[hidden]>
Date: Wed, 30 Apr 2025 18:08:18 -0700
On Wednesday, 30 April 2025 15:58:35 Pacific Daylight Time Tiago Freire via
Std-Proposals wrote:
> Here’s a benchmark.
> Here are the results: using an AMD 7950X
>
> MSVC on Windows:
> func<div_ub> 324 ns 322 ns 2133333
> func<div_check> 324 ns 328 ns 2240000
> func<div_throw> 336 ns 338 ns 2036364
> GCC on WSL:
> func<div_ub> 324 ns 322 ns 2133333
> func<div_check> 324 ns 328 ns 2240000
> func<div_throw> 336 ns 338 ns 2036364
>
> In other words. It makes almost no difference whatsoever, the exception one
> is slightly worse (but still not significant).

Your div_throw function is identical to div_ub because the try block can't
throw exceptions.

Modifying your test so we can get low-level performance counter data using
perf stat -r10, where each run is 1024*1024 loops of the divisions in your
func() function:

 Performance counter stats for './a.out ub' (10 runs):

          1,564.93 msec task-clock:u
                 0 context-switches:u
                 0 cpu-migrations:u
               128 page-faults:u
     6,850,020,684 cycles:u
     1,432,032,452 instructions:u
        18,148,262 branches:u
            16,740 branch-misses:u # 0.09% of all branches

 Performance counter stats for './a.out check' (10 runs):

          1,618.02 msec task-clock:u
                 0 context-switches:u
                 0 cpu-migrations:u
               129 page-faults:u
     7,084,607,943 cycles:u
     2,037,060,870 instructions:u
       270,855,134 branches:u
            16,760 branch-misses:u

The UB code runs at 0.21 instructions per cycle, while the checking one
manages 0.29, which is a horrible number for anyone doing optimisations.
That's completely expected considering we're talking about the single worst
non-memory instruction in the x86 processor: 64-bit integer division. This
particular instruction has a problem that it pipelines poorly, which explains
the reciprocal measurement: it's taking 4 to 5 cycles to run each instruction
in this application. That means the processor has a lot of spare cycles to run
speculatively ahead of the division and check that the next dozens of input
numbers aren't zero. I can't say though whether this is a good or bad micro-
benchmark. It's not testing a hypothesis of the OP, which is that this extra
check would pollute the BPU used by everything else in the application.

What it does show is that, in spite of adding 252.7k branches and nearly 600
million instructions, the execution time only increased by 53 milliseconds, in
a 1M loop run. Each loop has 16 loops of 16 divisions, so the marginal cost of
the added check was 198 picoseconds per division. The branch-miss counter
change is negligible.

Another way to view this is that it added 234M cycles in 268M divisions, so an
average of 0.874 extra cycles.

I did check that div_check and div_ub did emit and not emit a branch,
respectively:
    1593: test %r15,%r15
    1596: je 1731 <void func<&(div_check(long long, long long))>()
+0x281>
    159c: mov %rcx,%rax
    159f: cqto
    15a1: idiv %r15
    15a4: add %rax,%rdx
    15a7: add %rdx,%rsi
    15aa: test %r14,%r14
    15ad: je 1731 <void func<&(div_check(long long, long long))>()
+0x281>
    15b3: mov %rcx,%rax
    15b6: cqto
    15b8: idiv %r14
    15bb: add %rax,%rdx
    15be: add %rdx,%rsi

and
    13f4: cqto
    13f6: idiv %r14
    13f9: add %rdx,%rax
    13fc: add %rax,%rsi
    13ff: mov %rcx,%rax
    1402: cqto
    1404: idiv %r13
    1407: add %rdx,%rax
    140a: add %rax,%rsi
    140d: mov %rcx,%rax

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Principal Engineer - Intel DCAI Platform & System Engineering

Received on 2025-05-01 01:08:21