What is threaded code? What is it good for? What are the differences between the various threading techniques? How do I implement threaded code portably? How fast are various threading techniques?
Threaded code, in its original meaning [bell73], is one of the techniques for implementing virtual machine interpreters. Nowadays, at least the Forth community uses the term threading for almost any technique used for implementing Forth's virtual machine.
call Ar call Br call CrThis is known as subroutine-threaded code, although it is not threaded code in the original sense; in fact, subroutine threading is not even an interpretive technique.
Now, let's eliminate the calls:
Ar Br Cri.e., a sequence of code addresses that represent the virtual machine instructions.
As a consequence, we cannot execute this piece of code by jumping to its start. We also have to keep track of the pointer to the current instruction in a separate register (instead of using the processor's program counter register and return address stack/register).
How do we execute the next instruction? Let's assume that the instruction pointer register (ip) always points to the word in the code sequence that directly follows the current instruction word. Then we just have to load the word, jump to the routine pointed to by it, and increment the instruction pointer. E.g., in MIPS assembly language:
lw $2,0($4) #get next instruction, $4=inst.ptr. addu $4,$4,4 #advance instruction pointer j $2 #execute next instruction #nop #branch delay slotThis routine is the interpreter (aka inner interpreter in the Forth community). It is also known as NEXT routine. Either every virtual machine instruction routine has a copy of NEXT at the end, or they share one copy of NEXT and jump to it. With modern processors, the shared NEXT not only costs a jump, but also dramatically increases the misprediction rate of the indirect jump on CPUs with BTBs and similar indirect branch predictors [ertl&gregg03jilp]. Therefore, I recommend not to share NEXT.
The method described above is the one described in [bell73] as threaded code and has later been called direct threaded code.
Note that, in contrast to popular myths, subroutine threading is usually slower than direct threading. But it is a starting point for native code generation.
lit (for literal), followed
by the value of the constant. If the constant occurs frequently, it is
more space-efficient to have a virtual machine instruction for this
constant. If we have several constants, the code for their virtual
machine instructions will look very similar. So, we may want to share
the code, while having separate data.
We can do this by adding a level of indirection in NEXT (resulting in indirect threaded code. Each word (a generalization of the virtual machine instruction) has a code field and a parameter field. E.g., in the case of the constant:
code-field: docon #code address of shared CONstant code parameter-field: value #value of constantThe virtual machine code is now represented by a sequence of code field addresses, not code addresses. Simple virtual machine instructions (primitives) are typically represented like this:
code-field2: parameter-field2 parameter-field2: code #machine code of the primitiveThe NEXT of indirect threaded code looks like this in MIPS assembly language:
lw $2,0($4) #get the code-field address of the next word, $4=inst.ptr. #nop #load delay slot lw $3,0($2) #get the code address of the word addu $4,$4,4 #advance instruction pointer j $3 #execute code for the word #nop #branch delay slotThe code for the next instruction can compute the parameter-field address from the code-field address in
$2.
(Note: On the Pentium, the K5 and the K6(-2) it is extremely expensive to mix code and data, so this form of direct threading is much slower on this processor than indirect threading.)
Here I present variants corresponding to direct threading. Add an indirection in NEXT to get the indirect threaded version.
switch
statement).
In GNU C, a direct threaded NEXT looks like this:
typedef void *Inst; Inst *ip; /* you should use a local variable for this */ #define NEXT goto **ip++
typedef void (* Inst)();
void inst1(Inst *ip, /* other regs */)
{
...
(*ip)(ip+1, /* other registers */);
}
switch
statement (or similar statements in other languages, e.g., Pascal's
CASE). The result has the same advantages as token
threaded code; the disadvantage over token threaded code is the lower
speed due to the range check generated by most (all?) C compilers for
the switch; moreover, with the right encoding token threaded code can
avoid scaling, while the C compilers I have seen do not avoid it.
The main performance drawback of switch threading on modern CPUs is is that it uses one shared indirect branch; this leads to close to 100% mispredictions in the BTB (branch target buffer) used for indirect branch prediction on many modern CPUs (e.g., Athlon 64, Pentium 4, PPC 970), whereas threaded code with separate NEXTs has only about 50% mispredictions [ertl&gregg03jilp].
Switch threading in C looks like this:
typedef enum {
add /* ... */
} Inst;
void engine()
{
static Inst program[] = { inst1 /* ... */ };
Inst *ip;
for (;;)
switch (*ip++) {
case inst1:
...
break;
...
}
}
Separate NEXTs and the resulting reduction in BTB mispredictions and
increased performance can be implemented with switch, too, by replicating
the switch, as follows:
typedef enum {
add /* ... */
} Inst;
void engine()
{
static Inst program[] = { inst1 /* ... */ };
Inst *ip;
switch (*ip++) {
case inst1: goto label1;
...
}
label1:
...
switch (*ip++) {
case inst1: goto label1;
...
}
...
}
While this technique gives the same branch prediction advantages as
the separate NEXTs in threaded code, it does not eliminate the other
performance disadvantages of switch threading.
Moreover, each virtual machine instruction is represented by a
function (procedure). This is very elegant to look at, and allows
separate compilation, but it means that global variables have to be
used for the virtual machine registers (e.g., the instruction pointer
ip), which most (all?) compilers allocate into memory; in
contrast, with switch threading you can use local variables, which are
(hopefully) allocated into registers.
Call threading looks like this:
typedef void (* Inst)();
Inst *ip;
void inst1()
{
...
}
void engine()
{
for (;;)
(*ip++)();
}
Note: Call threading is a term I have used for this
technique. There seems to be no established term yet. I have seen
subroutine threading used for this technique, but that term
has a well-established and different meaning (see above).
Interestingly, the system described in [moore&leach70] (before Moore worked at NRAO) does not use threaded code, but string interpretation. However, it uses a code field, i.e., it has the seed for indirect threading; Then again, code fields were folklore at that time [kay96].
@Article{bell73,
author = "James R. Bell",
title = "Threaded Code",
journal = cacm,
year = 1973,
volume = 16,
number = 6,
pages = "370--372"
}
@Article{dewar75,
author = {Robert B.K. Dewar},
title = {Indirect Threaded Code},
journal = cacm,
year = {1975},
volume = {18},
number = {6},
month = jun,
pages = {330--331}
}
@string{ieeecomputer="Computer"}
@Article{kogge82,
author = "Peter M. Kogge",
title = "An Architectural Trail to Threaded-Code Systems",
journal = ieeecomputer,
year = "1982",
pages = "22--32",
month = mar,
annote = "Explains the design of (a classical
implementation of) Forth, starting with threaded
code, then adding the parameter stack, constants,
variables, control structures, dictionary, outer
interpreter and compiler."
}
@InProceedings{ertl93,
author = "M. Anton Ertl",
title = "A Portable {Forth} Engine",
booktitle = "EuroFORTH '93 conference proceedings",
year = "1993",
address = "Mari\'ansk\'e L\'azn\`e (Marienbad)",
url = "http://www.complang.tuwien.ac.at/papers/ertl93.ps.Z",
abstract = "The Forth engine discussed in this paper is written
in GNU C, which provides several extensions that are
important for Forth implementation. The indirect
threaded Forth engine is completely
machine-independent, direct threading requires a few
machine-specific lines for each machine. Using
a portable language like GNU C encourages producing
an engine with many primitives. In order to make the
development of primitives easier and less
error-prone, an automatic tool generates most of the
code for a Forth primitive from the stack effect
notation, even if the TOS is kept in a register. The
engine is combined with the parts of the system
written in Forth by loading a machine-independent
image file that contains the executable Forth code
in relocatable form."
}
@InProceedings{ertl95pldi,
author = "M. Anton Ertl",
title = "Stack Caching for Interpreters",
crossref = "sigplan95",
pages = "315--327",
url = "http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz",
abstract = "An interpreter can spend a significant part of its
execution time on arguments of virtual machine
instructions. This paper explores two methods to
reduce this overhead for virtual stack machines by
caching top-of-stack values in (real machine)
registers. The {\em dynamic method} is based on
having, for every possible state of the cache, one
specialized version of the whole interpreter; the
execution of an instruction usually changes the
state of the cache and the next instruction is
executed in the version corresponding to the new
state. In the {\em static method} a state machine
that keeps track of the cache state is added to the
compiler. Common instructions exist in specialized
versions for several states, but it is not necessary
to have a version of every instruction for every
cache state. Stack manipulation instructions are
optimized away."
}
@Proceedings{sigplan95,
booktitle = "SIGPLAN '95 Conference on Programming Language
Design and Implementation",
title = "SIGPLAN '95 Conference on Programming Language
Design and Implementation",
year = "1995",
key = "SIGPLAN '95"
}
@Article{ertl&gregg03jilp,
author = {M. Anton Ertl and David Gregg},
title = {The Structure and Performance of \emph{Efficient}
Interpreters},
journal = {The Journal of Instruction-Level Parallelism},
year = {2003},
volume = {5},
month = nov,
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
note = {http://www.jilp.org/vol5/},
abstract = {Interpreters designed for high general-purpose
performance typically perform a large number of
indirect branches (3.2\%--13\% of all executed
instructions in our benchmarks). These branches
consume more than half of the run-time in a number
of configurations we simulated. We evaluate how
accurate various existing and proposed branch
prediction schemes are on a number of interpreters,
how the mispredictions affect the performance of the
interpreters and how two different interpreter
implementation techniques perform with various
branch predictors. We also suggest various ways in
which hardware designers, C compiler writers, and
interpreter writers can improve the performance of
interpreters.}
}
@TechReport{moore&leach70,
author = "Charles H. Moore and Geoffrey C. Leach",
title = "FORTH -- A Language for Interactive Computing",
institution = "Mohasco Industries, Inc.",
year = "1970",
address = "Amsterdam, NY",
url = "http://www.dnai.com/~jfox/F70POST.ZIP"
}
@Article{moore87,
author = {Charles Moore},
title = {Forth -- eine persönliche Sprache},
journal = {Vierte Dimension},
year = {1987},
volume = {3},
number = {3},
month = oct,
pages = {11--13},
note = {Translated into German by Klaus Schleisiek, the
original is probably in More on NC4000},
}
@InProceedings{kay96,
author = "Alan C. Kay",
title = "The Early History of Smalltalk",
crossref = "hopl2",
pages = "511--579",
}
@Proceedings{hopl2,
title = {History of Programming Languages},
booktitle = {History of Programming Languages},
year = 1996,
key = {HOPL-II},
publisher = {ACM Press/Addison-Wesley}
}