Date: Tue, 19 Jan 2021 08:50:14 -0800
On Monday, 18 January 2021 08:59:09 PST Wesley Oliver via Std-Proposals wrote:
> Only way I see that happen in the future is if a processor can execute two
> instructions simultaneous, involving different registers.
They do that. Intel processors for the past 10 years have been able to execute
up to 5 instructions (micro-ops) per cycle. That's the theoretical maximum and
won't happen all the time, but it's fairly common to see code execute 2 or 3
instructions per cycle. I believe this is common on AArch64 systems too but I
don't have direct experience.
This is especially true of loop-overhead instructions, since more often than
not they don't depend on the data itself. Range checking certainly qualifies
for this. In fact, because the processor runs pipelined, it will have
concluded the loop has terminated or not terminated 10 to 100 cycles before
the instructions doing the actual work have reached there.
And this is even more true of short loops like yours because the entire work
fits in the CPU structures designed to detect loops and execute them REALLY
fast. See https://www.anandtech.com/show/2594/4 (and note how this article is
from 2008!).
So please study the state of the art first. Your initiative to improve things
is commendable, but if you propose things that were done 15 years ago, you'll
be wasting your time.
[For the nitpickers: yes, I know LSD was turned off in some microarchitectures
because it had problems. That doesn't invalidate the point.]
> Only way I see that happen in the future is if a processor can execute two
> instructions simultaneous, involving different registers.
They do that. Intel processors for the past 10 years have been able to execute
up to 5 instructions (micro-ops) per cycle. That's the theoretical maximum and
won't happen all the time, but it's fairly common to see code execute 2 or 3
instructions per cycle. I believe this is common on AArch64 systems too but I
don't have direct experience.
This is especially true of loop-overhead instructions, since more often than
not they don't depend on the data itself. Range checking certainly qualifies
for this. In fact, because the processor runs pipelined, it will have
concluded the loop has terminated or not terminated 10 to 100 cycles before
the instructions doing the actual work have reached there.
And this is even more true of short loops like yours because the entire work
fits in the CPU structures designed to detect loops and execute them REALLY
fast. See https://www.anandtech.com/show/2594/4 (and note how this article is
from 2008!).
So please study the state of the art first. Your initiative to improve things
is commendable, but if you propose things that were done 15 years ago, you'll
be wasting your time.
[For the nitpickers: yes, I know LSD was turned off in some microarchitectures
because it had problems. That doesn't invalidate the point.]
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel DPG Cloud Engineering
Received on 2021-01-19 10:50:20