Date: Fri, 2 Aug 2019 09:50:02 +1200
Thanks Staffan & Niall,
my understanding is that gather-scatter has greatly improved in AVX512
skylake and onwards, see the following links for details:
https://i.stack.imgur.com/s4OOt.png
https://stackoverflow.com/questions/21774454/how-are-the-gather-instructions-in-avx2-implemented
but I take your point that Actual benefits may be minuscule depending on
application area.
In terms of my query, I guess the main point is:
If SIMD processing of elements is worthwhile in a given use-case, do you
think it's worth exposing colony internals so that the programmer can
create a gather-mask by parallel-processing the internal skipfields,
or do you think it's going to be just as well-performing for the
programmer to construct the mask in serial via iterating over the colony
as per usual?
On 2/08/2019 12:20 AM, Niall Douglas via SG14 wrote:
> On 01/08/2019 13:12, Tjernstrom, Staffan via SG14 wrote:
>> Hey Matt,
>>
>> I'm dubious, but not staunchly against.
>>
>> My anecdotal (no current measurements) reason being that I've seen naive code outperform SIMD code for short memory arenas (say < 10 cache lines), presumably due to the down-clocking effect of the later SIMD instruction sets.
>
> Also, Haswell and later have become remarkably good at rewriting short
> runs of scalar code to perform exactly as if they were written using
> SIMD. About 200-300 instructions, I've generally found. If you're
> looking at a loop of less than 200 instructions, and you can guarantee a
> newer CPU, chances are high that a SIMD rewrite will confer negligible
> gains.
>
> Where scatter-gather SIMD really shines is on ARM NEON, but NEON has
> actually useful scatter-gather. One can often avoid whole memory copies
> using their support. I wish Intel would do the same.
>
> (Some tell me AVX-512 does now have this, but I haven't investigated)
>
> Niall
> _______________________________________________
> SG14 mailing list
> SG14_at_[hidden]
> http://lists.isocpp.org/mailman/listinfo.cgi/sg14
>
my understanding is that gather-scatter has greatly improved in AVX512
skylake and onwards, see the following links for details:
https://i.stack.imgur.com/s4OOt.png
https://stackoverflow.com/questions/21774454/how-are-the-gather-instructions-in-avx2-implemented
but I take your point that Actual benefits may be minuscule depending on
application area.
In terms of my query, I guess the main point is:
If SIMD processing of elements is worthwhile in a given use-case, do you
think it's worth exposing colony internals so that the programmer can
create a gather-mask by parallel-processing the internal skipfields,
or do you think it's going to be just as well-performing for the
programmer to construct the mask in serial via iterating over the colony
as per usual?
On 2/08/2019 12:20 AM, Niall Douglas via SG14 wrote:
> On 01/08/2019 13:12, Tjernstrom, Staffan via SG14 wrote:
>> Hey Matt,
>>
>> I'm dubious, but not staunchly against.
>>
>> My anecdotal (no current measurements) reason being that I've seen naive code outperform SIMD code for short memory arenas (say < 10 cache lines), presumably due to the down-clocking effect of the later SIMD instruction sets.
>
> Also, Haswell and later have become remarkably good at rewriting short
> runs of scalar code to perform exactly as if they were written using
> SIMD. About 200-300 instructions, I've generally found. If you're
> looking at a loop of less than 200 instructions, and you can guarantee a
> newer CPU, chances are high that a SIMD rewrite will confer negligible
> gains.
>
> Where scatter-gather SIMD really shines is on ARM NEON, but NEON has
> actually useful scatter-gather. One can often avoid whole memory copies
> using their support. I wish Intel would do the same.
>
> (Some tell me AVX-512 does now have this, but I haven't investigated)
>
> Niall
> _______________________________________________
> SG14 mailing list
> SG14_at_[hidden]
> http://lists.isocpp.org/mailman/listinfo.cgi/sg14
>
Received on 2019-08-01 16:52:05