Date: Tue, 13 Jan 2026 22:13:52 +0100
> On 13 Jan 2026, at 21:26, Jan Schultke <janschultke_at_[hidden]> wrote:
>
>> > If recursive template-based implementations can generate better code,
>> > then the compiler could just skip the recursive template nonsense and
>> > just generate the better code. That recursive templates can be used to
>> > trick the compiler into that better code generation doesn't stop this
>> > from being a QoI issue.
>>
>> The recursive templates can't easily be omitted because the iteration is too complex to be written by hand, and part of the optimizations happens in the pipelining, which is on a lower level.
>>
> There is still no reason why the target-specific instruction selection for iN operations in LLVM necessarily selects worse instructions for say, 128-bit division than division between mod_int. CPU pipelining is below assembly anyway, so somewhat beyond what compilers can directly control.
LLVM is written in C, and for 128 bits, it already has the standard 2-by-1 64-bit word algorithm, though 2-by-2 is binary division. If one wants to have the former for 256 bits, the code must somehow be iterated. Using the standard multiprecision division algorithm would introduce a loop, breaking the pipelining.
Then, mod_int<m> would be for any modulus: it should just be padded to an efficient type, having full 2-word multiplication result and 2-by-1 (divq-type) division. Multiprecision algorithms are complicated.
>> > If you want this to move forward, you need to show:
>> >
>> > 1. That better code generation exists. This is not about the
>> > *mechanism* of that code generation (ie: recursive templates); I'm
>> > talking about the actual resulting assembly.
>> > 2. That a compiler is incapable of generating that code using the
>> > existing `_BitInt` mechanism. Not that compilers don't currently do
>> > it; that a compiler *fundamentally* cannot do so through `_BitInt`.
>> > That there is some information which cannot exist that prevents it
>> > from generating the optimal assembly.
>>
>> It relies on a mathematical analysis which can't be done in the general multiprecision division algorithm.
>>
> But LLVM does not perform multiprecison until instruction selection in the backend. It does mathematical analysis in the middle-end for iN as if that integer width was supported. That is, dividing by 2**100 should get strength-reduced to a 100-bit right-shift instead of first being broken down into countless instructions that comprise a 64-bit operation, preventing this strength reduction.
In cryptography, one may use numbers like in this example, for which there are no obvious reductions:
https://en.wikipedia.org/wiki/LCS35
> The fact that mathematical analysis can't be done when you break it down into multiprecision is actually why it's a terrible idea to use pure library implementations instead of fundamental types for N-bit integers: emitting 100 64-bit IR instructions instead of a single 128-bit IR instruction obfuscates the mathematics of the operation to the compiler. This is why we need _BitInt.
There are special algorithms when dividing with one ore two words, and further when the dividend is two words, regardless of what level you intend to invoke them.
>
>> > If recursive template-based implementations can generate better code,
>> > then the compiler could just skip the recursive template nonsense and
>> > just generate the better code. That recursive templates can be used to
>> > trick the compiler into that better code generation doesn't stop this
>> > from being a QoI issue.
>>
>> The recursive templates can't easily be omitted because the iteration is too complex to be written by hand, and part of the optimizations happens in the pipelining, which is on a lower level.
>>
> There is still no reason why the target-specific instruction selection for iN operations in LLVM necessarily selects worse instructions for say, 128-bit division than division between mod_int. CPU pipelining is below assembly anyway, so somewhat beyond what compilers can directly control.
LLVM is written in C, and for 128 bits, it already has the standard 2-by-1 64-bit word algorithm, though 2-by-2 is binary division. If one wants to have the former for 256 bits, the code must somehow be iterated. Using the standard multiprecision division algorithm would introduce a loop, breaking the pipelining.
Then, mod_int<m> would be for any modulus: it should just be padded to an efficient type, having full 2-word multiplication result and 2-by-1 (divq-type) division. Multiprecision algorithms are complicated.
>> > If you want this to move forward, you need to show:
>> >
>> > 1. That better code generation exists. This is not about the
>> > *mechanism* of that code generation (ie: recursive templates); I'm
>> > talking about the actual resulting assembly.
>> > 2. That a compiler is incapable of generating that code using the
>> > existing `_BitInt` mechanism. Not that compilers don't currently do
>> > it; that a compiler *fundamentally* cannot do so through `_BitInt`.
>> > That there is some information which cannot exist that prevents it
>> > from generating the optimal assembly.
>>
>> It relies on a mathematical analysis which can't be done in the general multiprecision division algorithm.
>>
> But LLVM does not perform multiprecison until instruction selection in the backend. It does mathematical analysis in the middle-end for iN as if that integer width was supported. That is, dividing by 2**100 should get strength-reduced to a 100-bit right-shift instead of first being broken down into countless instructions that comprise a 64-bit operation, preventing this strength reduction.
In cryptography, one may use numbers like in this example, for which there are no obvious reductions:
https://en.wikipedia.org/wiki/LCS35
> The fact that mathematical analysis can't be done when you break it down into multiprecision is actually why it's a terrible idea to use pure library implementations instead of fundamental types for N-bit integers: emitting 100 64-bit IR instructions instead of a single 128-bit IR instruction obfuscates the mathematics of the operation to the compiler. This is why we need _BitInt.
There are special algorithms when dividing with one ore two words, and further when the dividend is two words, regardless of what level you intend to invoke them.
Received on 2026-01-13 21:14:09
