> Why would templates be able to be faster than _BitInt()? Is there any
> inherent issue in _BitInt() that doesn't allow compilers to implement it
> efficiently?

The templates are recursive, so they halve the words until one can replace it with hardware-supported assembler. The template functions have no backward jumps, encouraging pipelining, which is very complicated to write by hand if one were to iterate it.

Say it is on a 64-bit platform. For division, if there is divq, one can use it; if not, it will recurse down to a 32-bit implementation. If one is on a larger word size, one can use it by specializing an inline function.

None of these seem like "inherent issues". Everything you're describing boils down to the quality of lowering say, a 128-bit operation to the underlying 64-bit or 32-bit operations supported by hardware. That is, purely non-observable performance characteristics of implementation details.

Compilers already do a reasonably good job at lowering _BitInt(128) and _BitInt(4096) operations, and there is no innate reason why they would do a better job at _BitModInt(4096) or mod_int<4096> or whatever you're proposing. I don't see any scenario in which your proposed mod_int wouldn't just be an alias template for unsigned _BitInt.