Date: Thu, 7 Aug 2025 15:43:30 +0200
> On 7 Aug 2025, at 14:48, Sebastian Wittmeier via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
> Hi Hans,
> is the difference in speed to LLVM and GCC due to implementation or due to function interface for common standard types?
There is a classical division algorithm from the 1990s or earlier which typically is used, and which I improved on via some further mathematical analysis.
For GCC, I only used uint128_t = unsigned __int128, but it is on Mac OS, so it could well rely on the C library, which comes from LLVM. For Clang, I tried both uint128_t and the LLVM implementation code directly, and it was only about 10% difference. The LLVM code is a straightforward implementation of the original division algorithm. The GCC code looked more complicated, so I did not investigate that directly.
The difference is not that big. Some timings using random 2x64-by-64 bits:
Apple Silicon M4 2.9–4.4 GHz
div div128 div32 div32w2
clang O3 4 ns 9 ns 39 ns 27 ns
gcc O3 9 ns 12 ns 40 ns 22 ns
Here, “div” is my 2-by-1 division function, div128 is when using uint128_t on GCC and some reduction on Clang as the sources are somewhat faster, div32 is when applying the classical algorithm on strictly 32-bit words, and div32w2 the same algorithm when doing an optimization for two-word divisor in the style of “div”.
The “div” and “div128” functions are faster because they are 64-32 bit hybrids, so calculations take place in 64-bit words where possible, whereas “div32” and “div32w2” take place strictly within 32-bit words, with overflow and underflow carried separately. These 32-bit versions are only used for comparison: it is better to use the 64-bit versions for efficiency. There is no hardware support for 128-bit words, so there seems no point in doing a 128-64 bit word hybrid.
On Intel, the performance is not good, not even “divq” directly:
Intel Coffee Lake 2.4–4.1 GHz
div divq
clang O3 28 ns 25 ns
>
> Hi Hans,
> is the difference in speed to LLVM and GCC due to implementation or due to function interface for common standard types?
There is a classical division algorithm from the 1990s or earlier which typically is used, and which I improved on via some further mathematical analysis.
For GCC, I only used uint128_t = unsigned __int128, but it is on Mac OS, so it could well rely on the C library, which comes from LLVM. For Clang, I tried both uint128_t and the LLVM implementation code directly, and it was only about 10% difference. The LLVM code is a straightforward implementation of the original division algorithm. The GCC code looked more complicated, so I did not investigate that directly.
The difference is not that big. Some timings using random 2x64-by-64 bits:
Apple Silicon M4 2.9–4.4 GHz
div div128 div32 div32w2
clang O3 4 ns 9 ns 39 ns 27 ns
gcc O3 9 ns 12 ns 40 ns 22 ns
Here, “div” is my 2-by-1 division function, div128 is when using uint128_t on GCC and some reduction on Clang as the sources are somewhat faster, div32 is when applying the classical algorithm on strictly 32-bit words, and div32w2 the same algorithm when doing an optimization for two-word divisor in the style of “div”.
The “div” and “div128” functions are faster because they are 64-32 bit hybrids, so calculations take place in 64-bit words where possible, whereas “div32” and “div32w2” take place strictly within 32-bit words, with overflow and underflow carried separately. These 32-bit versions are only used for comparison: it is better to use the 64-bit versions for efficiency. There is no hardware support for 128-bit words, so there seems no point in doing a 128-64 bit word hybrid.
On Intel, the performance is not good, not even “divq” directly:
Intel Coffee Lake 2.4–4.1 GHz
div divq
clang O3 28 ns 25 ns
Received on 2025-08-07 13:43:47