Altivec & Valarray: Flowing with the Go (I)

Data melts from the pristine platters of the hard disk, streams through the deep banks of RAM. It surges through the pipes of the front-side bus, it cascades through the waterfalls of the L3, L2 and L1 caches, finally being sucked into the insatiable maw of the CPU. Nothing must be allowed to break its flow.

The Power Mac G5’s screaming fast dual 2 GHz PowerPC 970’s are only half the story.

Scalar execution units chug through data one piece at a time, but Altivec swallows 4, 8 or even 16 pieces at a time. In theory then, a 1 GHz Altivec-enabled G4 should be able to gulp down 4 GHz worth of data — they may call it the Velocity Engine, but Altivec is really Apple’s secret weapon in the megahertz wars.

What is so for scalar execution units found in most CPU’s is even more so for vector or SIMD (Single Instruction Multiple Data) execution units like Motorola’s Altivec or Intel’s SSE.

In practice though, many things act to break up the flow of data.

Hardware in Hot Water

On the hardware side, the fast chip is betrayed by a slow bus. Even if the CPU consumes data 4 times as fast as the competition, but the bus goes the same speed, eventually the fast but shallow caches are depleted and it slows down to the speed of the RAM trickling through the bus.

Unfortunately, for many years Apple’s Power Mac G4 has been dammed by this very issue — their last crop of G4’s starved a 1.4GHz CPU with a bus 8 times slower. It’s all about the bandwidth.

That’s why Apple’s new Power Mac G5 is “the world’s fastest personal computer” — its screaming fast dual 2 GHz PowerPC 970’s are only half the story. With an architecture tuned for throughput, it has 12x bus bandwidth, 2x memory bandwidth, 7x peripheral bandwidth and 3x storage bandwidth than the fastest G4.

We finally have a machine that can feed the naked need of the Altivec unit.

Steaming about Software

On the software side, we need to pump the next Photoshop or Premiere full of Altivec instructions. But SIMD only works if you code it that way: single instructions that do the same thing to multiple data. That’s why compilers that automatically vectorize code have such a hard time of it. Not only do they have to spot patterns of parallelism that may or may not be there in your source code, they can only vectorize when your data is aligned correctly. After all, compilers are only… not human.

The function calls to a separately compiled library module screw up the otherwise straight line of execution.

How do you then code for Altivec?

You could use the Altivec C programming interface, a set of intrinsic types and C functions designed by Motorola. I believe they’ve done a great job exposing the metal of Altivec to your average non-assembly programmer, with the strong data typing and overloading of an object-oriented language.

Still, the syntax leaves something to be desired — why write vec_add (v1, v2) when you meant v1 + v2? And as soon as you want to use or reuse code in libraries, you run into another set of problems.

You could use a set of C libraries, for example Apple provides vecLib for Mac OS X. But because the library is a separately compiled module, the function calls to it from your code screw up the otherwise straight line of execution and increase the number of redundant loads and stores — problems fatal to fast code.

So if the library has simple functions, you can mix and match them flexibly, but your code will get this problem and slow down. On the other hand, if the library has complex functions, the functionality may not be what you want.

I believe that is why more programmers aren’t into Altivec. The interface syntax isn’t as natural as it could be, and it’s really difficult to find a library that balances the needs of the processor and the needs of the human programmer.

» Altivec & Valarray: Flowing with the Go (II)