C++ Logo


Advanced search

A novel way to SIMD library

From: kate <schwaa_at_[hidden]>
Date: Sat, 16 May 2020 19:01:21 +0800
Hi there.

I'd like to share with you a spirit about building SIMD library. I
created a post
on the reddit and some readers said that it could be a viable
replacement for the
existing SIMD proposal. I am not an expert on that so I'd like to know
if you think it
really deserves a proposal.

There are already tons of SIMD libraries, which usually export user-friendly
interfaces by wrapping underlying messy SIMD intrinsics. But I found a
novel approach
to implement similar interfaces. The idea is very simple: unroll the
loops introduced
by vector operations at compile time and leave the rest work, i.e.
generating code using
SIMD instructions, to compilers. Then the vectorization is done. It
relies on compilers'
optimization ability to generate vectorized instructions from unrolled
code. For modern
compiler, it's should be easy.

A simple benchmark showed that code vectorized by this way had
performance similar
to those using intrinsics.

Its simplicity brings some significant advantages over those traditional

1) Easy to be implemented and standardized. All it needs are template
tricks. Unrolling
can be realized using parameter pack expansion, while user-friendly
interfaces depend on
template functions and operator overloading. Neither intrinsics nor
extra library are required.

2) Portability. Since no intrinsics are required, it can be used on
various platforms.

3) Evolving efficiency. As compilers are constantly improved, the work
falling on them
will be done better and better.

4) Never out of date. Every few years, CPU vendors will add a batch of
new SIMD instructions.
The library built in this way need no modifications, because the
compilers will adapt them
to these changes and generate more efficient code.

But it does have shortcomings. Experiments show that:

1) Performance varies across compilers. Different compilers generated
code of different
performance. And Clang behaved better than GCC.

2) It doesn't help much with simple code, as compilers have already
optimized them very well.

If this method becomes more mainstream, it will incentivise compilers
and CPU vendors to
ensure the best SIMD instructions for each architecture are used for
various transforms on
homogeneous of contiguous data, which will only strength this
technique's efficacy.

That's all about it. I'm looking forward to your comments.

Reddit Post:

A simple and incomplete implementation:

Yours Sincerely,
Zhang Zhang

Received on 2020-05-16 06:04:29