ISOCPP std-proposals List: Re: [std-proposals] Modular integers

From: Hans Åberg <haberg_1_at_[hidden]>
Date: Tue, 13 Jan 2026 22:20:54 +0100

> On 13 Jan 2026, at 22:03, Marcin Jaczewski <marcinjaczewski86_at_[hidden]> wrote:
>
> wt., 13 sty 2026 o 21:05 Hans Åberg via Std-Proposals
> <std-proposals_at_[hidden]> napisał(a):
>>
>>
>>> On 13 Jan 2026, at 20:31, Jason McKesson via Std-Proposals <std-proposals_at_[hidden]> wrote:
>>>
>>> On Tue, Jan 13, 2026 at 1:12 PM Hans Åberg via Std-Proposals
>>> <std-proposals_at_[hidden]> wrote:
>>>> One way to optimize _BitInt(N) for time might be to find a word of 2^⁽2^k) bits that it fits into, and then use recursive templates halving the words. Then the problem is this requires C++, not available in C. So it goes back to the problem that this is a C type that inherits its limitations. But then, there would be no difference in performance between these types.
>>>
>>> If recursive template-based implementations can generate better code,
>>> then the compiler could just skip the recursive template nonsense and
>>> just generate the better code. That recursive templates can be used to
>>> trick the compiler into that better code generation doesn't stop this
>>> from being a QoI issue.
>>
>> The recursive templates can't easily be omitted because the iteration is too complex to be written by hand, and part of the optimizations happens in the pipelining, which is on a lower level.
>
> Who said that compilers need a handroll sequence of operations for each type??
> If you can create any algorithm using templates then the compiler can
> do SAME algorytm in fraction of cost,
> order of magnitude faster than any template or constexpr operation as
> it will run on raw x86 that will select a sequence of x86 operations.
> And first of all it will be blazing fast on `-O0` instead of the slog
> of dozens of superficial functions required by template abstraction.

There is a big difference between CPUs with the same compiler, Clang. The same 2/5 GHz clock frequency too only about 4 ns per 2-by-1 division on an Apple Silicon, whereas on an Intel older from 2019 some 30–40 ns. The difference seems to be in the pipelining: the instructions must be computed in parallel for this low latency, and the written code must be structured to admit this.

Received on 2026-01-13 21:21:14