On RISCs the Gforth engine is very close to optimal; i.e., it is usually impossible to write a significantly faster engine.
On register-starved machines like the 386 architecture processors
improvements are possible, because gcc
does not utilize the
registers as well as a human, even with explicit register declarations;
e.g., Bernd Beuster wrote a Forth system fragment in assembly language
and hand-tuned it for the 486; this system is 1.19 times faster on the
Sieve benchmark on a 486DX2/66 than Gforth compiled with
gcc-2.6.3
with -DFORCE_REG
.
However, this potential advantage of assembly language implementations
is not necessarily realized in complete Forth systems: We compared
Gforth (direct threaded, compiled with gcc-2.6.3
and
-DFORCE_REG
) with Win32Forth 1.2093, LMI's NT Forth (Beta, May
1994) and Eforth (with and without peephole (aka pinhole) optimization
of the threaded code); all these systems were written in assembly
language. We also compared Gforth with three systems written in C:
PFE-0.9.14 (compiled with gcc-2.6.3
with the default
configuration for Linux: -O2 -fomit-frame-pointer -DUSE_REGS
-DUNROLL_NEXT
), ThisForth Beta (compiled with gcc-2.6.3 -O3
-fomit-frame-pointer; ThisForth employs peephole optimization of the
threaded code) and TILE (compiled with make opt
). We benchmarked
Gforth, PFE, ThisForth and TILE on a 486DX2/66 under Linux. Kenneth
O'Heskin kindly provided the results for Win32Forth and NT Forth on a
486DX2/66 with similar memory performance under Windows NT. Marcel
Hendrix ported Eforth to Linux, then extended it to run the benchmarks,
added the peephole optimizer, ran the benchmarks and reported the
results.
We used four small benchmarks: the ubiquitous Sieve; bubble-sorting and
matrix multiplication come from the Stanford integer benchmarks and have
been translated into Forth by Martin Fraeman; we used the versions
included in the TILE Forth package, but with bigger data set sizes; and
a recursive Fibonacci number computation for benchmarking calling
performance. The following table shows the time taken for the benchmarks
scaled by the time taken by Gforth (in other words, it shows the speedup
factor that Gforth achieved over the other systems).
relative Win32- NT eforth This- time Gforth Forth Forth eforth +opt PFE Forth TILE sieve 1.00 1.39 1.14 1.39 0.85 1.58 3.18 8.58 bubble 1.00 1.31 1.41 1.48 0.88 1.50 3.88 matmul 1.00 1.47 1.35 1.46 0.74 1.58 4.09 fib 1.00 1.52 1.34 1.22 0.86 1.74 2.99 4.30
You may find the good performance of Gforth compared with the systems
written in assembly language quite surprising. One important reason for
the disappointing performance of these systems is probably that they are
not written optimally for the 486 (e.g., they use the lods
instruction). In addition, Win32Forth uses a comfortable, but costly
method for relocating the Forth image: like cforth
, it computes
the actual addresses at run time, resulting in two address computations
per NEXT (see section Image File Background).
Only Eforth with the peephole optimizer performs comparable to Gforth. The speedups achieved with peephole optimization of threaded code are quite remarkable. Adding a peephole optimizer to Gforth should cause similar speedups.
The speedup of Gforth over PFE, ThisForth and TILE can be easily explained with the self-imposed restriction of the latter systems to standard C, which makes efficient threading impossible (however, the measured implementation of PFE uses a GNU C extension: section `Defining Global Register Variables' in GNU C Manual). Moreover, current C compilers have a hard time optimizing other aspects of the ThisForth and the TILE source.
Note that the performance of Gforth on 386 architecture processors
varies widely with the version of gcc
used. E.g., gcc-2.5.8
failed to allocate any of the virtual machine registers into real
machine registers by itself and would not work correctly with explicit
register declarations, giving a 1.3 times slower engine (on a 486DX2/66
running the Sieve) than the one measured above.
Note also that there have been several releases of Win32Forth since the release presented here, so the results presented here may have little predictive value for the performance of Win32Forth today.
In Translating Forth to Efficient C by M. Anton Ertl and Martin
Maierhofer (presented at EuroForth '95), an indirect threaded version of
Gforth is compared with Win32Forth, NT Forth, PFE, and ThisForth; that
version of Gforth is 2%-8% slower on a 486 than the direct
threaded version used here. The paper available at
`http://www.complang.tuwien.ac.at/papers/ertl&maierhofer95.ps.gz';
it also contains numbers for some native code systems. You can find a
newer version of these measurements at
`http://www.complang.tuwien.ac.at/forth/performance.html'. You can
find numbers for Gforth on various machines in `Benchres'.