compiler | gcc 3.3 | CW 9.3 | VC++ 2003 | |
---|---|---|---|---|
os | OS X 10.3.6 | WinXP SP2 | ||
cpu | G5 2.0 | P4 2.8 | PM 1.6 | |
multiply add | 2000 | 2083 | 538 | 704 |
inner product | 943 | 943 | 641 | 1220 |
polynomial | 676 | 704 | 321 | 311 |
hypotenuse | 347 | 370 | 214 | 131 |
complex multiply add | 485 | --- | --- | --- |
predicate | 2500 | 1220 | --- | --- |
slicing | 275 | 238 | 118 | 78 |
power | 60.5 | 50.4 | --- | --- |
trigonometric | 39.8 | 71.7 | 29.1 | 19.8 |
macstl 0.2 and the fighting compilers are back after a hiatus of over a year. For your viewing pleasure, we have cleaned out the ring, trained up the incumbent and brought in two new contenders the everyman Pentium 4 at 2.8GHz and the new kid Pentium M (Centrino) at 1.6GHz to challenge our dual PowerPC G5 at 2.0 GHz.
The benchmarks are all-new over the ones featured in macstl 0.1.5. They target significantly longer expressions such as multiply-adds, polynomials and trigonometric functions the kind of expressions youd use in real life. All are tuned to be single-threaded, live within L2 cache and have denormal handling off, minimizing the skew of fast dual processors and slow main memory. Weve also set compiler options to the highest optimization levels, including strict aliasing and loop unrolling.
Were also measuring speed-up over hand-coded scalar loops, which tells you directly how much benefit youd get out of macstl on your platform. This test will be handy for seeing how we fare against auto-vectorizing compilers, if and when they become generally available.
compiler | gcc 3.3 | CW 9.3 | VC++ 2003 | |
---|---|---|---|---|
os | OS X 10.3.6 | WinXP SP2 | ||
cpu | G5 2.0 | P4 2.8 | PM 1.6 | |
multiply add | 3.5 | 3.6 | 1.2 | 2.4 |
inner product | 2.8 | 2.8 | 3.0 | 4.1 |
polynomial | 2.3 | 3.2 | 1.1 | 1.4 |
hypotenuse | 4.1 | 6.8 | 4.7 | 5.2 |
complex multiply add | 3.1 | --- | --- | --- |
predicate | 3.5 | 2.2 | --- | --- |
slicing | 0.84 | 0.75 | 0.33 | 0.51 |
power | 6.7 | 5.4 | --- | --- | trigonometric | 11.8 | 16.1 | 9.6 | 3.6 |
The speed crown is won deservedly by CodeWarrior 9.3 on the PowerPC G5. Note how on the complicated expressions like polynomial, hypotenuse and trigonometric, the number of operations is significantly higher than gcc 3.3. The macstl-generated code also shows the greatest speed-up over scalar loops.
The G5 roundly trashes both the Pentium 4 and the Pentium M, despite being at a slower rate than the Pentium 4. So much for the MHz myth. I put this down to the abundance of registers available for the PowerPC ISA and proper design of the Altivec unit, allowing SIMD calculations to run full speed on the CPU rather than being hampered by loads from cache or memory.
Interestingly enough, the Pentium M holds its own against the faster Pentium 4, especially with the simpler expressions a result of Intels redesign of the architecture and cache. This factors in its overall win for inner product, which has the least use of store opcodes.
Clearly, macstl will accelerate your code on all sorts of compilers, operating systems and CPUs. Only the slicing test showed an actual slowdown over writing your own loops, while the trigometric test showed a speed-up of 3.6x to 16.1x over your own loops. So why dont you download macstl, run the benchmark on your own system and see if its worth the sticker price!