Date: Tue, 13 Jan 2026 22:53:58 +0100
> On 13 Jan 2026, at 22:30, Thiago Macieira via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
> On Tuesday, 13 January 2026 13:20:54 Pacific Standard Time Hans Åberg via Std-
> Proposals wrote:
>>> Who said that compilers need a handroll sequence of operations for each
>>> type?? If you can create any algorithm using templates then the compiler
>>> can do SAME algorytm in fraction of cost,
>>> order of magnitude faster than any template or constexpr operation as
>>> it will run on raw x86 that will select a sequence of x86 operations.
>>> And first of all it will be blazing fast on `-O0` instead of the slog
>>> of dozens of superficial functions required by template abstraction.
>>
>> There is a big difference between CPUs with the same compiler, Clang. The
>> same 2/5 GHz clock frequency too only about 4 ns per 2-by-1 division on an
>> Apple Silicon, whereas on an Intel older from 2019 some 30–40 ns. The
>> difference seems to be in the pipelining: the instructions must be computed
>> in parallel for this low latency, and the written code must be structured
>> to admit this.
>
> You're not answering the questions.
Oh. To many replies. :-)
> In fact, the answer you gave has absolutely nothing to do with Marcin's
> statement. Marcin said that if the compiler implements support for 256-bit
> integers, then it won't need to expand and instantiate templates, which in -O0
> mode means there are no non-inlined inline functions.
But that is the very point: the templates recurse down half-words until there is an implementation. So in your example, it will just call the 256-bit specialized function that is implemented.
>
> On Tuesday, 13 January 2026 13:20:54 Pacific Standard Time Hans Åberg via Std-
> Proposals wrote:
>>> Who said that compilers need a handroll sequence of operations for each
>>> type?? If you can create any algorithm using templates then the compiler
>>> can do SAME algorytm in fraction of cost,
>>> order of magnitude faster than any template or constexpr operation as
>>> it will run on raw x86 that will select a sequence of x86 operations.
>>> And first of all it will be blazing fast on `-O0` instead of the slog
>>> of dozens of superficial functions required by template abstraction.
>>
>> There is a big difference between CPUs with the same compiler, Clang. The
>> same 2/5 GHz clock frequency too only about 4 ns per 2-by-1 division on an
>> Apple Silicon, whereas on an Intel older from 2019 some 30–40 ns. The
>> difference seems to be in the pipelining: the instructions must be computed
>> in parallel for this low latency, and the written code must be structured
>> to admit this.
>
> You're not answering the questions.
Oh. To many replies. :-)
> In fact, the answer you gave has absolutely nothing to do with Marcin's
> statement. Marcin said that if the compiler implements support for 256-bit
> integers, then it won't need to expand and instantiate templates, which in -O0
> mode means there are no non-inlined inline functions.
But that is the very point: the templates recurse down half-words until there is an implementation. So in your example, it will just call the 256-bit specialized function that is implemented.
Received on 2026-01-13 21:54:16
