How Fast is the Valarray, Really?

Operations per Clock Tick (larger = faster)
operation	gcc 3.1 libstdc++	macstl 0.1, Altivec off	macstl 0.1, Altivec on
inline arithmetic	446	888	3460
inline transcendental	74	78	1052
outline transcendental	80	99	39
inline scalarization	1485	1488	4291
unchunked apply	408	408	404
unchunked slice	2932	2865	2890
unchunked mask	221	161	168
unchunked indirect	358	425	540

Here’s a stack of benchmarks that show how the implementation stacks up.

Each test is run untimed for a few loops, then timed for many loops, and a throughput value calculated. The source is available in the main.cpp inside the download.

I used a size of 1000 elements for all valarrays. This intentionally keeps the data cache-bound, which maximizes the effect of Altivec code in slow bus architectures like the test machine — a Dual 450 Mhz Power Macintosh G4 with 2 x 1 MB L2 cache and 1 GB memory, running Mac OS X 10.2.6. Mileage will differ with a bandwidth-tuned Power Macintosh G5.

The code is compiled using the gcc 3.1 libstdc++ valarray classes and also with the macstl 0.1 valarray implementation, with the Altivec optimizations turned off (by commenting out the appropriate chunk_traits specialization) and also on. The compiler switches used were:

-O3 -faltivec -fstrict-aliasing -save-temps

As you can see, the inline arithmetic test is 7.76x faster, the inline transcendental test is 14.21x faster and the inline scalarization test is 2.89x faster than gcc scalar code. The combination of vector code and inlining is unbeatable.

The outline transcendental test is actually slower than gcc scalar code, showing how much is lost by calling into separately compiled modules. And the unchunked rates are comparable to or worse than gcc, indicating areas for more performance tuning.

A Closer Look

Even in the non-optimized case, macstl code is almost twice as fast in the inline arithmetic case than gcc’s. A look at the compiled PowerPC opcodes for the inner loop of the following expression reveals why — keep an eye on the all-important loads and stores, which could access slow memory.

std::valarray <float> vf1 (vf2 + vf3 + vf4);

Compiled Opcodes
gcc 3.1 libstdc++	macstl 0.1, Altivec off	macstl 0.1, Altivec on
`L172: lwz r9,0(r3) slwi r2,r12,2 lwz r4,4(r3) addi r12,r12,1 lwz r11,4(r9) lwz r10,0(r9) lwz r7,4(r11) lwz r6,4(r10) lfsx f0,r7,r2 lfsx f1,r6,r2 lwz r0,4(r4) fadds f2,f1,f0 lfsx f3,r2,r0 fadds f1,f2,f3 stfs f1,0(r5) addi r5,r5,4 bdnz L172`	`L452: slwi r2,r11,2 addi r11,r11,1 lfsx f4,r4,r2 lfsx f0,r5,r2 lfsx f3,r6,r2 fadds f2,f4,f0 fadds f1,f2,f3 stfsx f1,r2,r8 bdnz L452`	`L504: slwi r2,r9,4 addi r9,r9,1 lvx v1,r5,r2 lvx v0,r4,r2 lvx v13,r6,r2 vaddfp v0,v0,v1 vaddfp v1,v0,v13 stvx v1,r2,r8 bdnz L504`

Since the expression is adding 3 valarrays and storing into 1 valarray, the theoretical minimum number of loads is 3 and stores is 1.

In the gcc case, there are 7 extraneous lwz to load various pointers, 3 lsfx to load the actual floats and 1 stfs to store the result. The lwz are strictly unnecessary as these pointers are not modified within the code.

In the macstl without Altivec case, the code has eliminated the 7 lwz, removed 1 loop index increment addi and replaced the stfs with a stfsx to reuse the loop index r2. In the macstl with Altivec case, the code has replaced the scalar lsfx with the vector lvx, the scalar fadds with vaddfp and stfsx with the vector stvx within the same number of opcodes.

Thus as you can see, my library succeeded in getting rid of all the extraneous loads and stores, and reducing opcode count from 17 to just 9 — hand tuning would save an additional opcode at most. The optimized version exactly replaced all the vectorizable opcodes as well.

More Results

A set of results for the new Power Mac G5 would really tell how good the new architecture is. Gentle reader, you could run this on your G5 — or better still, donate me a G5, I won't complain.

» The macstl gcc rematch