operation | gcc 3.1 libstdc++ | macstl 0.1, Altivec off | macstl 0.1, Altivec on |
---|---|---|---|
inline arithmetic | 446 | 888 | 3460 |
inline transcendental | 74 | 78 | 1052 |
outline transcendental | 80 | 99 | 39 |
inline scalarization | 1485 | 1488 | 4291 |
unchunked apply | 408 | 408 | 404 |
unchunked slice | 2932 | 2865 | 2890 |
unchunked mask | 221 | 161 | 168 |
unchunked indirect | 358 | 425 | 540 |
Heres a stack of benchmarks that show how the implementation stacks up.
Each test is run untimed for a few loops, then timed for many loops, and a throughput value calculated. The source is available in the main.cpp
inside the download.
I used a size of 1000 elements for all valarrays. This intentionally keeps the data cache-bound, which maximizes the effect of Altivec code in slow bus architectures like the test machine a Dual 450 Mhz Power Macintosh G4 with 2 x 1 MB L2 cache and 1 GB memory, running Mac OS X 10.2.6. Mileage will differ with a bandwidth-tuned Power Macintosh G5.
The code is compiled using the gcc 3.1 libstdc++ valarray classes and also with the macstl 0.1 valarray implementation, with the Altivec optimizations turned off (by commenting out the appropriate chunk_traits
specialization) and also on. The compiler switches used were:
-O3 -faltivec -fstrict-aliasing -save-temps
As you can see, the inline arithmetic test is 7.76x faster, the inline transcendental test is 14.21x faster and the inline scalarization test is 2.89x faster than gcc scalar code. The combination of vector code and inlining is unbeatable.
The outline transcendental test is actually slower than gcc scalar code, showing how much is lost by calling into separately compiled modules. And the unchunked rates are comparable to or worse than gcc, indicating areas for more performance tuning.
Even in the non-optimized case, macstl code is almost twice as fast in the inline arithmetic case than gccs. A look at the compiled PowerPC opcodes for the inner loop of the following expression reveals why keep an eye on the all-important loads and stores, which could access slow memory.
std::valarray <float> vf1 (vf2 + vf3 + vf4);
gcc 3.1 libstdc++ | macstl 0.1, Altivec off | macstl 0.1, Altivec on |
---|---|---|
L172: |
L452: |
L504: |
Since the expression is adding 3 valarrays and storing into 1 valarray, the theoretical minimum number of loads is 3 and stores is 1.
In the gcc case, there are 7 extraneous lwz
to load various pointers, 3 lsfx
to load the actual floats and 1 stfs
to store the result. The lwz
are strictly unnecessary as these pointers are not modified within the code.
In the macstl without Altivec case, the code has eliminated the 7 lwz
, removed 1 loop index increment addi
and replaced the stfs
with a stfsx
to reuse the loop index r2
. In the macstl with Altivec case, the code has replaced the scalar lsfx
with the vector lvx
, the scalar fadds
with vaddfp
and stfsx
with the vector stvx
within the same number of opcodes.
Thus as you can see, my library succeeded in getting rid of all the extraneous loads and stores, and reducing opcode count from 17 to just 9 hand tuning would save an additional opcode at most. The optimized version exactly replaced all the vectorizable opcodes as well.
A set of results for the new Power Mac G5 would really tell how good the new architecture is. Gentle reader, you could run this on your G5 or better still, donate me a G5, I won't complain.