Choosing an API for SIMD

This thing bothers me about library documentation. The API allows two or more ways of doing the same thing — obviously, the developers had a good reason for injecting semantic noise into my thought processes. But then, they only document the function, and perhaps leave some sample code; they don't explain the good and the bad of each approach. I am lost — I want to know why, not just how.

With that in mind, let’s look at the choices you have when calculating the inner product of two vectors. There are four ways you can do this in macstl:

Using valarray

Using valarray is the simplest method and you’ll get surprisingly fast results on all platforms. Forget about writing any loops or even understanding SIMD, just apply the expression you want directly to the valarray and you’re done.

#include <macstl/valarray.h> using namespace stdext; valarray <float> v1 (1000); valarray <float> v2 (1000); float s = (v1 * v2).sum (); // automatically uses Altivec fused multiply-add

Valarrays also transparently handle alignment and tail scalar elements, and advanced slicing is available.

On the other hand, you can't finesse storage or order of execution, nor use potentially faster or more specific platform-specific intrinsics. And unlike the following methods, they can’t be combined with the other methods.

Using vec common interface

Using the vec common interface gives a little more control and still works on all platforms. You’ll have to do your own SIMD chunking and looping, but the syntax is still pretty straightforward too.

#include <macstl/vec.h> using namespace macstl; vec <float, 4> v1 [250]; vec <float, 4> v2 [250]; vec <float, 4> vs; for (int i = 0; i != 250; ++i) vs += v1 [i] * v2 [i]; // doesn't use Altivec fused multiply-add float s = vs.sum ();

You have to handle your own alignment, tail elements, storage and order of execution. Because the common functions may translate to several intrinsics, and there's no opportunity for expression optimization, the common interface may be slightly slower than the other methods. But you can mix and match with the following methods.

Using vec platform interface

Gain the ultimate control with the vec platform interface, but give up platform independence.

#include <macstl/vec.h> using namespace macstl; using namespace macstl::altivec; vec <float, 4> v1 [250]; vec <float, 4> v2 [250]; vec <float, 4> vs; for (int i = 0; i != 250; ++i) vs = madd (v1 [i], v2 [i], vs); // use the Altivec fused multiply-add explicitly vs = add (vs, slo (vs, vec <unsigned char, 16>::fill <32> ())); // shift and add vs = add (vs, slo (vs, vec <unsigned char, 16>::fill <64> ())); // shift and add float s = vs [0];

You have to handle your own alignment, tail elements, storage and order of execution. Because the platform functions are 1-1 to the underlying intrinsics though, you'll get the maximum performance.

Using the vec functors

If you love STL, you can use vec functors together with containers and algorithms. Unfortunately the syntax is something else entirely.

#include <macstl/vec.h> using namespace macstl; using namespace macstl::altivec; vec <float, 4> v1 [250]; vec <float, 4> v2 [250]; vec <float, 4> vs = stdext::accumulate2 (v1, v1 + 250, v2, vec <float, 4> (), madd_function <vec <float, 4> > ()); // use the Altivec fused multiply-add explicitly float s = vs.sum ();

Therein lies the power of generic programming — maximum flexibility with storage since you can use any compatible container, and optimum performance with the chosen algorithm. You will still have to handle your own alignment and tail elements.

» vector initialization