Yes, that is how I discovered it was much faster. It is a template, essentially an implementation of div_wide in this proposal:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3161r4.html#functions.div_wide

LLVM implements it with a loop. I made recursive templates, removing the loop and all multiplications except one per halfword.

Tiago Freire has a copy which might be used for a reference implementation, in case you would like to add requirements that compilers have it, not merely being optional.

If you have an optimization opportunity that LLVM does not take, why don't you make an LLVM PR or bug report instead of a C++ proposal?

> And if so, why can't the compiler do
> the same? What is the reason it cannot do the same?

Why don't you ask the compiler or compiler writer? :-)

Why don't we ask you? You're the one arguing there needs to be a mod_int C++ feature, so you're the one who needs to motivate it.

In order to remove the loop, one has to restructure the condition. For a full word implementation, one has to use two's complement features. Then in addition, using the mathematical facts that preliminary division overshoots with at most 2, that the add back step is not needed, and can further be exploited to eliminate a final multiplication.

And why is it innately impossible for LLVM to perform this two's-complement-based and mathematical optimization for unsigned _BitInt(128), unlike for mod_int<128>?