Pentium vs G5

Thousand Operations per Second (greater = faster)
compiler	gcc 3.3	CW 9.3	VC++ 2003
os	OS X 10.3.6		WinXP SP2
cpu	G5 2.0		P4 2.8	PM 1.6
multiply add	2000	2083	538	704
inner product	943	943	641	1220
polynomial	676	704	321	311
hypotenuse	347	370	214	131
complex multiply add	485	---	---	---
predicate	2500	1220	---	---
slicing	275	238	118	78
power	60.5	50.4	---	---
trigonometric	39.8	71.7	29.1	19.8

macstl 0.2 and the fighting compilers are back after a hiatus of over a year. For your viewing pleasure, we have cleaned out the ring, trained up the incumbent and brought in two new contenders — the everyman Pentium 4 at 2.8GHz and the new kid Pentium M (Centrino) at 1.6GHz — to challenge our dual PowerPC G5 at 2.0 GHz.

The wrestling ring

The benchmarks are all-new over the ones featured in macstl 0.1.5. They target significantly longer expressions such as multiply-adds, polynomials and trigonometric functions — the kind of expressions you’d use in real life. All are tuned to be single-threaded, live within L2 cache and have denormal handling off, minimizing the skew of fast dual processors and slow main memory. We’ve also set compiler options to the highest optimization levels, including strict aliasing and loop unrolling.

We’re also measuring speed-up over hand-coded scalar loops, which tells you directly how much benefit you’d get out of macstl on your platform. This test will be handy for seeing how we fare against auto-vectorizing compilers, if and when they become generally available.

Times Faster than Scalar Loops (greater = faster)
compiler	gcc 3.3	CW 9.3	VC++ 2003
os	OS X 10.3.6		WinXP SP2
cpu	G5 2.0		P4 2.8	PM 1.6
multiply add	3.5	3.6	1.2	2.4
inner product	2.8	2.8	3.0	4.1
polynomial	2.3	3.2	1.1	1.4
hypotenuse	4.1	6.8	4.7	5.2
complex multiply add	3.1	---	---	---
predicate	3.5	2.2	---	---
slicing	0.84	0.75	0.33	0.51
power	6.7	5.4	---	---
trigonometric	11.8	16.1	9.6	3.6

The commentary

The speed crown is won deservedly by CodeWarrior 9.3 on the PowerPC G5. Note how on the complicated expressions like polynomial, hypotenuse and trigonometric, the number of operations is significantly higher than gcc 3.3. The macstl-generated code also shows the greatest speed-up over scalar loops.

The G5 roundly trashes both the Pentium 4 and the Pentium M, despite being at a slower rate than the Pentium 4. So much for the MHz myth. I put this down to the abundance of registers available for the PowerPC ISA and proper design of the Altivec unit, allowing SIMD calculations to run full speed on the CPU rather than being hampered by loads from cache or memory.

Interestingly enough, the Pentium M holds its own against the faster Pentium 4, especially with the simpler expressions — a result of Intel’s redesign of the architecture and cache. This factors in its overall win for inner product, which has the least use of store opcodes.

Clearly, macstl will accelerate your code on all sorts of compilers, operating systems and CPUs. Only the slicing test showed an actual slowdown over writing your own loops, while the trigometric test showed a speed-up of 3.6x to 16.1x over your own loops. So why don’t you download macstl, run the benchmark on your own system and see if it’s worth the sticker price!

» reference