ISOCPP std-proposals List: Re: [std-proposals] Modular integers

From: (wrong string) Åberg <haberg_1_at_[hidden]>
Date: Tue, 13 Jan 2026 17:58:57 +0100

> On 13 Jan 2026, at 16:08, Jan Schultke <janschultke_at_[hidden]> wrote:
>
>> > Why would templates be able to be faster than _BitInt()? Is there any
>> > inherent issue in _BitInt() that doesn't allow compilers to implement it
>> > efficiently?
>>
>> The templates are recursive, so they halve the words until one can replace it with hardware-supported assembler. The template functions have no backward jumps, encouraging pipelining, which is very complicated to write by hand if one were to iterate it.
>>
>> Say it is on a 64-bit platform. For division, if there is divq, one can use it; if not, it will recurse down to a 32-bit implementation. If one is on a larger word size, one can use it by specializing an inline function.
>>
> None of these seem like "inherent issues". Everything you're describing boils down to the quality of lowering say, a 128-bit operation to the underlying 64-bit or 32-bit operations supported by hardware. That is, purely non-observable performance characteristics of implementation details.

This is a reply to a question about implementation.

> Compilers already do a reasonably good job at lowering _BitInt(128) and _BitInt(4096) operations, and there is no innate reason why they would do a better job at _BitModInt(4096) or mod_int<4096> or whatever you're proposing. I don't see any scenario in which your proposed mod_int wouldn't just be an alias template for unsigned _BitInt.

It is important for pipelining to avoid backward jumps, like in loops. Despite Clang having an efficient 2-word by 1-word 64-bit implementation, its 2-word by 2-word division uses binary division, which is slow. In addition, 2-word by 1-word division can be optimized to only use one multiplication per half-word.

Received on 2026-01-13 16:59:14