The best of both worlds, a standard but slower implementation for non-Altivec machines and a optimized, faster implementation for Altivec machines.
C++, the one-time star of the C language family, now comfortably holding elder brother status to Java and C#, may hold the answer.
C++ has operator overloading, where you can actually code v1 + v2
for your own types. This feature lets us humans apply our intuition about regular, everyday arithmetic to a different problem domain yielding code that is heaps easier to write as well as read.
C++ also has function inlining, where the compiler folds in the actual code from called functions instead of just leaving in the function call itself. After inlining, a sharp compiler would then also hoist loop invariants out of loops, detect unaliased pointers, weed out redundant loads and stores and keep data in fast registers rather than slow memory.
Altivecs final salvation could be a module no longer much in favor in the harem of the Standard C++ Library. std::valarray
has been eclipsed of late by her sexier sisters std::vector
and the STL container classes. However if you look into her origins, youll find she was designed for this very thing: numerical computing using vector supercomputers.
Heres why and how Altivec and std::valarray
make good partners:
You can write clear, understandable code like v1 = v2 * sin (v3), and see it compile down to tight loops with straight-line, minimal load/store PowerPC opcodes.
std::valarray
is std. Because its a standard with a known specification, there is more knowledge and certainty about it out there. A programmer doesnt have to learn a new and esoteric library interface. And its not going to go away any time soon.Best of both worlds. Vectorizing
std::valarray
for Altivec gives us the best of both worlds, a standard but slower implementation for non-Altivec machines and a optimized, faster implementation for Altivec machines. Its the closest thing to a vectorizing compiler, since you dont have to know or write Altivec instructions explicitly.No separate modules. Most of the Standard C++ Library lives as source code in header files, not in a separately compiled module. This means
std::valarray
functions can and will be inlined during compiling, precisely what we need for fast code.Type extensibility. valarrays are templated on their scalar types, which means you can declare a
std::valarray <float>
and astd::valarray <long>
and have pretty much the same set of operations available on both. Currently onlyvalarray<char>
,valarray<short>
,valarray<long>
andvalarray<float>
are optimized for Altivec. Conceivably support can be added later forlong long
andstd::complex
types without changing the interface.Control of alignment and scalar size. I wrote the memory allocation routines so that all valarrays are already properly aligned front and back for Altivec operations. And since valarray only allows binary operations between the same types even if the underlying types are otherwise convertible in C I didnt have to worry about packing or unpacking data between, say, a long element and a short element.
Using std::valarray
, you can write easy-to-understand code like v1 = v2 * sin (v3)
, and see it compile down to tight loops with straight-line, minimal load/store PowerPC opcodes no function calls, no other branching. I have seen inline arithmetic go 7.76x faster and inline transcendentals go 14.21x faster than unoptimized std::valarray
.
The Power Mac G5 finally gives the Altivec unit the hardware it deserves. It is my hope that Altivec software will take off too, with the Altivec-optimized valarray library.