Date: Tue, 13 Jan 2026 16:08:37 +0100
>
> > Why would templates be able to be faster than _BitInt()? Is there any
> > inherent issue in _BitInt() that doesn't allow compilers to implement it
> > efficiently?
>
> The templates are recursive, so they halve the words until one can replace
> it with hardware-supported assembler. The template functions have no
> backward jumps, encouraging pipelining, which is very complicated to write
> by hand if one were to iterate it.
>
> Say it is on a 64-bit platform. For division, if there is divq, one can
> use it; if not, it will recurse down to a 32-bit implementation. If one is
> on a larger word size, one can use it by specializing an inline function.
>
None of these seem like "inherent issues". Everything you're describing
boils down to the quality of lowering say, a 128-bit operation to the
underlying 64-bit or 32-bit operations supported by hardware. That is,
purely non-observable performance characteristics of implementation details.
Compilers already do a reasonably good job at lowering _BitInt(128) and
_BitInt(4096) operations, and there is no innate reason why they would do a
better job at _BitModInt(4096) or mod_int<4096> or whatever you're
proposing. I don't see any scenario in which your proposed mod_int wouldn't
just be an alias template for unsigned _BitInt.
> > Why would templates be able to be faster than _BitInt()? Is there any
> > inherent issue in _BitInt() that doesn't allow compilers to implement it
> > efficiently?
>
> The templates are recursive, so they halve the words until one can replace
> it with hardware-supported assembler. The template functions have no
> backward jumps, encouraging pipelining, which is very complicated to write
> by hand if one were to iterate it.
>
> Say it is on a 64-bit platform. For division, if there is divq, one can
> use it; if not, it will recurse down to a 32-bit implementation. If one is
> on a larger word size, one can use it by specializing an inline function.
>
None of these seem like "inherent issues". Everything you're describing
boils down to the quality of lowering say, a 128-bit operation to the
underlying 64-bit or 32-bit operations supported by hardware. That is,
purely non-observable performance characteristics of implementation details.
Compilers already do a reasonably good job at lowering _BitInt(128) and
_BitInt(4096) operations, and there is no innate reason why they would do a
better job at _BitModInt(4096) or mod_int<4096> or whatever you're
proposing. I don't see any scenario in which your proposed mod_int wouldn't
just be an alias template for unsigned _BitInt.
Received on 2026-01-13 15:08:51
