Date: Thu, 1 Aug 2019 13:20:55 +0100
On 01/08/2019 13:12, Tjernstrom, Staffan via SG14 wrote:
> Hey Matt,
>
> I'm dubious, but not staunchly against.
>
> My anecdotal (no current measurements) reason being that I've seen naive code outperform SIMD code for short memory arenas (say < 10 cache lines), presumably due to the down-clocking effect of the later SIMD instruction sets.
Also, Haswell and later have become remarkably good at rewriting short
runs of scalar code to perform exactly as if they were written using
SIMD. About 200-300 instructions, I've generally found. If you're
looking at a loop of less than 200 instructions, and you can guarantee a
newer CPU, chances are high that a SIMD rewrite will confer negligible
gains.
Where scatter-gather SIMD really shines is on ARM NEON, but NEON has
actually useful scatter-gather. One can often avoid whole memory copies
using their support. I wish Intel would do the same.
(Some tell me AVX-512 does now have this, but I haven't investigated)
Niall
> Hey Matt,
>
> I'm dubious, but not staunchly against.
>
> My anecdotal (no current measurements) reason being that I've seen naive code outperform SIMD code for short memory arenas (say < 10 cache lines), presumably due to the down-clocking effect of the later SIMD instruction sets.
Also, Haswell and later have become remarkably good at rewriting short
runs of scalar code to perform exactly as if they were written using
SIMD. About 200-300 instructions, I've generally found. If you're
looking at a loop of less than 200 instructions, and you can guarantee a
newer CPU, chances are high that a SIMD rewrite will confer negligible
gains.
Where scatter-gather SIMD really shines is on ARM NEON, but NEON has
actually useful scatter-gather. One can often avoid whole memory copies
using their support. I wish Intel would do the same.
(Some tell me AVX-512 does now have this, but I haven't investigated)
Niall
Received on 2019-08-01 07:22:59