You have been extremely insistent that what LLVM does is slow, in part because it has to do a binary division for multi-word-by-multi-word division. However, I now seeĀ https://github.com/llvm/llvm-project/blob/3424447645c0ae09cc97fc59fc0f2bd383a67ed1/compiler-rt/lib/builtins/udivmodti4.c#L113-L120 We can see that LLVM does a div_wide kind of operation when the quotient fits into the result, and I imagine that the surrounding code doesn't add too much cost.

You should benchmark to see whether your int_mod implementation is actually faster than _BitInt when possible; judging by the code you've shared, there may not be a significant difference.