This thing bothers me about library documentation. The API allows two or more ways of doing the same thing obviously, the developers had a good reason for injecting semantic noise into my thought processes. But then, they only document the function, and perhaps leave some sample code; they don't explain the good and the bad of each approach. I am lost I want to know why, not just how.
With that in mind, lets look at the choices you have when calculating the inner product of two vectors. There are four ways you can do this in macstl:
Using valarray is the simplest method and youll get surprisingly fast results on all platforms. Forget about writing any loops or even understanding SIMD, just apply the expression you want directly to the valarray and youre done.
#include <macstl/valarray.h>
using namespace stdext;
valarray <float> v1 (1000);
valarray <float> v2 (1000);
float s = (v1 * v2).sum (); // automatically uses Altivec fused multiply-add
Valarrays also transparently handle alignment and tail scalar elements, and advanced slicing is available.
On the other hand, you can't finesse storage or order of execution, nor use potentially faster or more specific platform-specific intrinsics. And unlike the following methods, they cant be combined with the other methods.
Using the vec common interface gives a little more control and still works on all platforms. Youll have to do your own SIMD chunking and looping, but the syntax is still pretty straightforward too.
#include <macstl/vec.h>
using namespace macstl;
vec <float, 4> v1 [250];
vec <float, 4> v2 [250];
vec <float, 4> vs;
for (int i = 0; i != 250; ++i)
vs += v1 [i] * v2 [i]; // doesn't use Altivec fused multiply-add
float s = vs.sum ();
You have to handle your own alignment, tail elements, storage and order of execution. Because the common functions may translate to several intrinsics, and there's no opportunity for expression optimization, the common interface may be slightly slower than the other methods. But you can mix and match with the following methods.
Gain the ultimate control with the vec platform interface, but give up platform independence.
#include <macstl/vec.h>
using namespace macstl;
using namespace macstl::altivec;
vec <float, 4> v1 [250];
vec <float, 4> v2 [250];
vec <float, 4> vs;
for (int i = 0; i != 250; ++i)
vs = madd (v1 [i], v2 [i], vs); // use the Altivec fused multiply-add explicitly
vs = add (vs, slo (vs, vec <unsigned char, 16>::fill <32> ())); // shift and add
vs = add (vs, slo (vs, vec <unsigned char, 16>::fill <64> ())); // shift and add
float s = vs [0];
You have to handle your own alignment, tail elements, storage and order of execution. Because the platform functions are 1-1 to the underlying intrinsics though, you'll get the maximum performance.
If you love STL, you can use vec functors together with containers and algorithms. Unfortunately the syntax is something else entirely.
#include <macstl/vec.h>
using namespace macstl;
using namespace macstl::altivec;
vec <float, 4> v1 [250];
vec <float, 4> v2 [250];
vec <float, 4> vs = stdext::accumulate2 (v1, v1 + 250, v2, vec <float, 4> (),
madd_function <vec <float, 4> > ()); // use the Altivec fused multiply-add explicitly
float s = vs.sum ();
Therein lies the power of generic programming maximum flexibility with storage since you can use any compatible container, and optimum performance with the chosen algorithm. You will still have to handle your own alignment and tail elements.