Date: Tue, 13 Jan 2026 13:30:31 -0800
On Tuesday, 13 January 2026 13:20:54 Pacific Standard Time Hans Åberg via Std-
Proposals wrote:
> > Who said that compilers need a handroll sequence of operations for each
> > type?? If you can create any algorithm using templates then the compiler
> > can do SAME algorytm in fraction of cost,
> > order of magnitude faster than any template or constexpr operation as
> > it will run on raw x86 that will select a sequence of x86 operations.
> > And first of all it will be blazing fast on `-O0` instead of the slog
> > of dozens of superficial functions required by template abstraction.
>
> There is a big difference between CPUs with the same compiler, Clang. The
> same 2/5 GHz clock frequency too only about 4 ns per 2-by-1 division on an
> Apple Silicon, whereas on an Intel older from 2019 some 30–40 ns. The
> difference seems to be in the pipelining: the instructions must be computed
> in parallel for this low latency, and the written code must be structured
> to admit this.
You're not answering the questions.
In fact, the answer you gave has absolutely nothing to do with Marcin's
statement. Marcin said that if the compiler implements support for 256-bit
integers, then it won't need to expand and instantiate templates, which in -O0
mode means there are no non-inlined inline functions.
Proposals wrote:
> > Who said that compilers need a handroll sequence of operations for each
> > type?? If you can create any algorithm using templates then the compiler
> > can do SAME algorytm in fraction of cost,
> > order of magnitude faster than any template or constexpr operation as
> > it will run on raw x86 that will select a sequence of x86 operations.
> > And first of all it will be blazing fast on `-O0` instead of the slog
> > of dozens of superficial functions required by template abstraction.
>
> There is a big difference between CPUs with the same compiler, Clang. The
> same 2/5 GHz clock frequency too only about 4 ns per 2-by-1 division on an
> Apple Silicon, whereas on an Intel older from 2019 some 30–40 ns. The
> difference seems to be in the pipelining: the instructions must be computed
> in parallel for this low latency, and the written code must be structured
> to admit this.
You're not answering the questions.
In fact, the answer you gave has absolutely nothing to do with Marcin's
statement. Marcin said that if the compiler implements support for 256-bit
integers, then it won't need to expand and instantiate templates, which in -O0
mode means there are no non-inlined inline functions.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Data Center - Platform & Sys. Eng.
Received on 2026-01-13 21:30:34
