cycles per NEXT (i.e., runtime/(1e8*cycletime)) gcc sub- Machine Processor version routine direct indir. switch call anti-btb DECStation 5000/125 R3000 25MHz 2.2.2 4.95 4.425 6.45 11.625 DecStation 5000/150 R4000 100MHz 2.4.5 9.7 7.5 10.5 16.8 SGI PowerChallenge XL R10000 195MHz 2.7.2 16.4 8.31 10.3 13.3 HP/Apollo 425 68040 25MHz 2.2.2 9.575+ 5.575 7.525 15.875 HP/Apollo 720 HP-PA 50MHz 2.3.2 7.77 5.495 7.525 9.87 Sun Ultra 1 UltraSparc 143MHz 2.8.1 9.86 11.43 13.29 17.29 UltraSPARC 30 UltraSPARC II 248MHz 2.7.2.3 8.08 11.33 13.49 17.51 AlphaPC 64 21064A 300MHz egcs-1.0.3 12.27$ 9.12 12.57 18.15 19.74 Alpha 164LX 21164A 600MHz egcs-1.0.3 7.8$ 7.8 10.32 10.86 Compaq XP1000 21264 500MHz egcs-1.0.3 9.4$ 17.25$ 17.25$ 9.55 PowerMac PPC604e 200MHz 2.7.2.1 5.74 7.24 9.34 12.4 Powerbook G3 PPC750 266MHz egcs-1.1.2 4.24 5.36 7.36 11.73 11.68 486 486DX2 50Mhz 2.2.2d 10.15* 7.2 7.3 10.75 Pentium PB Cache Pentium 133MHz 2.6.3 8.93*$ 3.73 4.73 17.52$ IBM/Cyrix 6x86-P166+ IBM6x86 133MHz 2.7.2.1* 5.48 5.71^ 7.29 7.6 DEll XPS Pro200n PentiumPro 200Mhz 2.7.2% 6.66 5.52 6.54 15.56 Mendocino Celeron 333MHz 2.7.2.3* 4.8 5.8 6.7 7.3 11.2 AMD K6-2 Super 7 K6-2 300MHz 2.7.2.3 4.23* 7.32 10.08 48.57$ Thunderbird/VIA KT133 Athlon 800MHz 2.95.1* 21.44$ 5.36 6.08 7.2 10.48 12.8 Northwood/i845E Pentium4 2.26GHz 2.95.3* 8.6 10.2 10.7 11.4 20.6 22.9 +manually unoptimized to become realistic &gcc version cygnus-2.7-96q4 *with -fomit-frame-pointer %gcc version cygnus-2.7.2-960712 ^compiled with gcc-2.6.3 (bug in RedHat's 2.7.2.1) $see text below
Note that a microbenchmark like this can uncover some problems, but not necessarily all of them, and cannot be used to predict the performance of real applications. This is particularly true for the more modern processors; read the notes below.
Thanks to Bernd Paysan for the values on the SPARCStation and the HP 700. Bernd does not know whether the SPARCStation is a 1 or 2; I guess from its slowness that it's a SPARCStation 1. Thanks to Franz Puntigam for the values on the 486. Thanks to Dominique de Waleffe for the PentiumPro numbers. Thanks to Thomas Gschwind for the 21164 and the IBM6x86 numbers. Thanks to Bernd Beuster for the results on the UltraSPARC 30.
All times are user time. The assembly code generated by the GNU C Compiler was inspected and found realistic, with one exception: I had to unoptimize the assembly code for subroutine threading on the 68040, since the compiler allocated the address of the function "next" to a register.
The benchmark consists of a loop that contains nine NEXTs and a looping instruction (a termination test and a jump back for subroutine threaded code), i.e. it primarily measures NEXT speed. This loop is executed 10,000,000 times (resulting in 100,000,000 NEXTs and a bit of overhead). It fits completely into the respective caches.
The older processors are relatively simple and perform quite predictably. For the newer ones, the performance on this benchmark is determined very much by microarchitectural properties like branch target buffers, return stacks, branch mispredict penalties, and cache consistency algorithms. Here are some notes on specific processors:
main()
is put after the other routines. Using the
restricted convention gives significant speedups: 5.97 cycles on the
21064a, 5.22 cycles on the 21164a, 4.6 cycles on the 21264. Which
convention is more appropriate depends on the system (for RAFTS/Alpha
we will probably use the restricted convention).
The Alpha architecture allows static branch prediction. Our direct and indirect threading benchmarks work with 0% prediction accuracy on the Alpha. 90% accuracy makes the results faster by about 4.5 cycles. In real Forth code, we can achieve 33% prediction accuracy for direct threading and 40% for indirect threading (see http://www.complang.tuwien.ac.at/forth/peep/).
The Pentium subroutine threading results also have a peculiarity. They are something of a worst case for the branch target buffer (for predicting the target of the RET). A change that implements the best case for the branch prediction (every call calls a different next()), runs in 3.76 cycles (as fast as direct threading). In practice, the branch prediction accuracy will be somewhere in between (probably closer to the best case).
Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
subroutine.c | 10-Jul-1992 08:55 | 211 | ||
pro-btb.c | 15-Apr-2004 15:55 | 492 | ||
indirect.c | 09-Jul-1992 17:07 | 320 | ||
direct.c | 09-Jul-1992 14:37 | 259 | ||
case.c | 03-Feb-1993 15:23 | 579 | switch threading | |
call.c | 20-Apr-1999 19:51 | 309 | ||
anti-btb.c | 27-Nov-1998 16:00 | 436 | direct.c modified to have only 10% BTB hit rate | |