Compilation of Stack-Based Languages, status April 1999

This project was supported by the FWF for two years starting in February 1996 (research project P11231). Since February 1998, the project has stalled, but I expect to continue it some time in the future.

Goal

The goal of the project Compilation of Stack-Based Languages is to advance the state of the art in compiling languages like Forth, Postscript and JavaVM to fast machine code. These languages have properties that make traditional compilation techniques badly suited for them:

Stack: items have to be allocated effectively to registers to get fast machine code.
Very fast compilation: is required, because of the interactive nature (Forth, Postscript), or because of the compile-and-run infrastructure (JavaVM just-in-time compilation).

Current Results

We currently have a working compiler from Forth to machine code, written in Forth:

It produces executable machine code for the MIPS architecture.
It can compile itself, repeatedly.
It runs on Gforth, an interpretive (direct threaded code) implementation of ANS Forth. The native code compiler is fully integrated with the interpreter, i.e., the interpreter can call native code and vice versa.
It uses tree parsing (of DAGs) for instruction selection, backwards list scheduling for instruction scheduling, and a simple basic block register allocation algorithm (with a new, simple, but effective register shuffling algorithm). Implemented naively, these phases would require nine passes; we integrate them into four passes, for increased compilation speed.
We use a linear intermediate representation that is quickly executable, because it is threaded code [bell73,kogge82]; this representation is ideal for code replication optimizations like inlining.

The main results up to now are:

The speedup over the interpreted Gforth is 3.32-4.46 for small benchmarks and 1.41 for the self-compilation benchmark. The self-compilation speedup is a little disappointing; we think that the main reason for this is that even the RAFTS-compiled version of RAFTS executes a significant number of threaded-code words.
Compilation of Forth via C [ertl&maierhofer95] gives code that is 0.89-1.97 times faster than the code produced by our native code compiler. The lower performance of our native code compiler on most benchmarks is caused by the lack of interprocedural register allocation combined with very short basic blocks (which are due to a very high call frequency).
The compilation speed is currently 2.9 times slower than for Gforth's Forth-to-threaded-code compiler. A slowdown is to be expected, because RAFTS has to perform the same I/O and dictionary lookup tasks as Gforth (and currently most of these tasks actually use the same threaded code), but has a much more complicated and time-consuming code generator; still, we think that compilation speed can be improved.

Publications

Our final report [ertl&pirker98] is a short paper written for a computer science audience. For more specific information, read our EuroForth papers [ertl&pirker96, ertl&pirker97] (but they assume a Forth background). There is also a position paper that gives an overview of the ideas [ertl92]. Finally, my dissertation [ertl96diss] discusses RAFTS and other Forth implementation techniques.

Future Work

Interprocedural register allocation: We will probably first do a standard graph colouring register allocator. It may be too slow, so we also think about a fast register allocation method that exploits the nice liveness properties of stack items. While this method may not produce as good code as established graph colouring techniques, we hope that the improved compile time will be worth this price.
Code replication: The main purpose of code replication is to produce longer basic blocks, that allow our basic block code generation techniques to be more effective. Currently the plan is to combine a conservative full inliner with an aggressive partial inliner and a control flow join delaying optimization. This should be pretty easy with our new linear intermediate representation.
Tail-call optimization: turns a call followed by an exit into a jump. It should work even better in combination with partial inlining (because partial inlining turns some calls into tail-calls).

References

[bell73] James R. Bell.: Threaded code. Communications of the ACM, 16 no. 6 pp. 370-372, 1973.
[ertl92] M. Anton Ertl.: A new approach to Forth native code generation. In EuroForth '92, pages 73-78, Southampton, England, 1992. MicroProcessor Engineering.
[ertl96diss] M. Anton Ertl.: Implementation of Stack-Based Languages on Register Machines. Dissertation, Technische Universit\"at Wien, Austria, 1996.
[ertl&maierhofer95] M. Anton Ertl and Martin Maierhofer.: Translating Forth to efficient C. In EuroForth '95 Conference Proceedings, Schlo\ss Dagstuhl, Germany, 1995.
[ertl&pirker96] M. Anton Ertl and Christian Pirker.: RAFTS for basic blocks: A progress report on Forth native code compilation. In EuroForth '96 Conference Proceedings, St. Petersburg, Russia, 1996.
[ertl&pirker97] M. Anton Ertl and Christian Pirker.: The structure of a Forth native code compiler. In EuroForth '97 Conference Proceedings, pages 107-116, Oxford, 1997.
[ertl&pirker98] M. Anton Ertl and Christian Pirker.: Compilation of stack-based languages (abschlu\ssbericht). Final report to fwf for research project p11231, Institut f\"ur Computersprachen, Technische Universit\"at Wien, 1998.
[kogge82] Peter M. Kogge.: An architectural trail to threaded-code systems. Computer, pages 22-32, March 1982.

Anton Ertl