> Compilers already do a reasonably good job at lowering _BitInt(128) and _BitInt(4096) operations, and there is no innate reason why they would do a better job at _BitModInt(4096) or mod_int<4096> or whatever you're proposing. I don't see any scenario in which your proposed mod_int wouldn't just be an alias template for unsigned _BitInt.

It is important for pipelining to avoid backward jumps, like in loops. Despite Clang having an efficient 2-word by 1-word 64-bit implementation, its 2-word by 2-word division uses binary division, which is slow. In addition, 2-word by 1-word division can be optimized to only use one multiplication per half-word.

That sounds like a missed optimization. You could file a bug report for LLVM or fix the issue yourself in a PR. Why would you expect an int_mod<128> to use a more efficient division than _BitInt(128)?