Altivec & Valarray: Flowing with the Go (II)

The best of both worlds, a standard but slower implementation for non-Altivec machines and a optimized, faster implementation for Altivec machines.

C++, the one-time star of the C language family, now comfortably holding elder brother status to Java and C#, may hold the answer.

C++ has operator overloading, where you can actually code v1 + v2 for your own types. This feature lets us humans apply our intuition about regular, everyday arithmetic to a different problem domain — yielding code that is heaps easier to write as well as read.

C++ also has function inlining, where the compiler folds in the actual code from called functions instead of just leaving in the function call itself. After inlining, a sharp compiler would then also hoist loop invariants out of loops, detect unaliased pointers, weed out redundant loads and stores and keep data in fast registers rather than slow memory.

Valarray to the rescue

Altivec’s final salvation could be a module no longer much in favor in the harem of the Standard C++ Library. std::valarray has been eclipsed of late by her sexier sisters std::vector and the STL container classes. However if you look into her origins, you’ll find she was designed for this very thing: numerical computing using vector supercomputers.

Here’s why and how Altivec and std::valarray make good partners:

You can write clear, understandable code like v1 = v2 * sin (v3), and see it compile down to tight loops with straight-line, minimal load/store PowerPC opcodes.

std::valarray is std. Because it’s a standard with a known specification, there is more knowledge and certainty about it out there. A programmer doesn’t have to learn a new and esoteric library interface. And it’s not going to go away any time soon.

Best of both worlds. Vectorizing std::valarray for Altivec gives us the best of both worlds, a standard but slower implementation for non-Altivec machines and a optimized, faster implementation for Altivec machines. It’s the closest thing to a vectorizing compiler, since you don’t have to know or write Altivec instructions explicitly.

No separate modules. Most of the Standard C++ Library lives as source code in header files, not in a separately compiled module. This means std::valarray functions can and will be inlined during compiling, precisely what we need for fast code.

Type extensibility. valarrays are templated on their scalar types, which means you can declare a std::valarray <float> and a std::valarray <long> and have pretty much the same set of operations available on both. Currently only valarray<char>, valarray<short>, valarray<long> and valarray<float> are optimized for Altivec. Conceivably support can be added later for long long and std::complex types without changing the interface.

Control of alignment and scalar size. I wrote the memory allocation routines so that all valarrays are already properly aligned front and back for Altivec operations. And since valarray only allows binary operations between the same types — even if the underlying types are otherwise convertible in C — I didn’t have to worry about packing or unpacking data between, say, a long element and a short element.

Using std::valarray, you can write easy-to-understand code like v1 = v2 * sin (v3), and see it compile down to tight loops with straight-line, minimal load/store PowerPC opcodes — no function calls, no other branching. I have seen inline arithmetic go 7.76x faster and inline transcendentals go 14.21x faster than unoptimized std::valarray.

The Power Mac G5 finally gives the Altivec unit the hardware it deserves. It is my hope that Altivec software will take off too, with the Altivec-optimized valarray library.

» A Valarray Tutorial