C++ Logo

std-proposals

Advanced search

Re: [std-proposals] Multiprecision division

From: Hans Åberg <haberg_1_at_[hidden]>
Date: Fri, 8 Aug 2025 09:17:16 +0200
> On 7 Aug 2025, at 22:04, Thiago Macieira via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
> On Thursday, 7 August 2025 06:43:30 Pacific Daylight Time Hans Åberg via Std-
> Proposals wrote:
>> Intel Coffee Lake 2.4–4.1 GHz
>> div divq
>> clang O3 28 ns 25 ns
>
> That's a 10-year-old architecture. The latency for this instruction is up to
> 73 cycles on CFL: https://uops.info/html-instr/DIV_R64.html#CFL
> 25 ns is either 73 cycles at 2.92 GHz or 60 cycles at 2.4 GHz or anything in-
> between.
>
> If you're going to micro-benchmark, I suggest measuring something that doesn't
> change with frequency, like the cycle count or the number of instructions
> retired (or both).
>
> Sunny Cove improves that to 18 cycles:
> https://uops.info/html-instr/DIV_R64.html#ICL
> Similar on AMD Zen 3 and 4:
> https://uops.info/html-instr/DIV_R64.html#ZEN4

My ambitions are not that high. :-) I was mostly interested in checking directly against “divq”; the claim is that it has poor implementation, which is in line with my testing.

On ARM64, only the function CNTVCT_EL0 is user-available for “virtual cycles”, and the clock frequency of that is 1 GHz, that is, 1 ns, and agreed with the timings I made. So one can just as well do timings. I did count cycles on Intel as well, but it did not give anything interesting to me.

Received on 2025-08-08 07:17:35