From anton Mon Aug 10 20:11:34 2015
X-newsreader: xrn 10.00-beta-3
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Subject: Re: Interpreter loops vs branch predictors
Path: mips.complang.tuwien.ac.at!anton
Newsgroups: comp.arch
References: <a1870856-6dfd-44f3-a94c-5f444f545e93@googlegroups.com>
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Date: Mon, 10 Aug 2015 17:49:19 GMT
Message-ID: <2015Aug10.194919@mips.complang.tuwien.ac.at>

William Edwards <willvarfar@gmail.com> writes:
>Are interpreter loops still slow?  https://hal.inria.fr/hal-01100647/document gives some interesting numbers and analysis into the state of modern branch predictors.   Nice paper :)

If could only skim it up to now, but anyway, I just ran a few small
benchmarks on an Ivy Bridge (Sandy Bridge shrink; Core i3-3227U CPU @
1.90GHz), and got the following results:

[b8:~/gforth:8519] for i in 'gforth-fast --ss-number=0 --ss-states=0 --no-dynamic' 'gforth-fast --ss-number=0 --ss-states=0 --no-super' 'gforth-fast --ss-number=0 --ss-states=0'; do echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done
gforth-fast --ss-number=0 --ss-states=0 --no-dynamic onebench.fs
 sieve bubble matrix   fib   fft
 0.261  0.325  0.202 0.474 0.235
31205591 branch-misses
gforth-fast --ss-number=0 --ss-states=0 --no-super onebench.fs
 sieve bubble matrix   fib   fft
 0.264  0.331  0.196 0.459 0.141
22584719 branch-misses
gforth-fast --ss-number=0 --ss-states=0 onebench.fs
 sieve bubble matrix   fib   fft
 0.194  0.268  0.115 0.271 0.073
13624653 branch-misses

The first is a threaded-code interpreter; the second employs
replication for improved branch prediction, and indeed, that does not
help for 4 out of 5 small benchmarks (but gives a good speedup on the
fifth); there is also a reduction in branch misses (but for all
benchmarks together).  The third one uses replication and dynamic
superinstructions to get rid of a lot of the indirect branches, and
more branch misses, and this provides a good speedup on all of the
benchmarks.

Here are some code samples, all for:

: foo 5 + ;
see-code foo

The threaded-code variant (--no-dynamic) decompiles as:

$7F61B00F2680 lit
$7F61B00F2688 <5> 
$7F61B00F2690 +
$7F61B00F2698 ;s ok

All the VM instructions point to the original code (and that is not
disassembled by SEE-CODE), e.g.:

Code +  
   0x00000000004050c7 <gforth_engine+2855>:     mov    %r15,%rax
   0x00000000004050ca <gforth_engine+2858>:     lea    0x8(%r15),%r15
   0x00000000004050ce <gforth_engine+2862>:     add    $0x8,%rbx
   0x00000000004050d2 <gforth_engine+2866>:     add    0x8(%rax),%r14
   0x00000000004050d6 <gforth_engine+2870>:     mov    -0x8(%rbx),%rdx
   0x00000000004050da <gforth_engine+2874>:     mov    %rdx,%rax
   0x00000000004050dd <gforth_engine+2877>:     jmpq   *%rax
end-code

The replication variant (--no-super) looks as follows:

$7FF540107680 lit
$7FF540107688 <5> 
   0x00007ff5418a6605:  mov    %rbx,%rax
   0x00007ff5418a6608:  mov    %r14,(%r15)
   0x00007ff5418a660b:  lea    0x10(%rbx),%rbx
   0x00007ff5418a660f:  mov    (%rax),%r14
   0x00007ff5418a6612:  sub    $0x8,%r15
   0x00007ff5418a6616:  mov    -0x8(%rbx),%rdx
   0x00007ff5418a661a:  mov    %rdx,%rax
   0x00007ff5418a661d:  jmpq   *%rax
   0x00007ff5418a661f:  nop
$7FF540107690 +
   0x00007ff5418a6620:  mov    %r15,%rax
   0x00007ff5418a6623:  lea    0x8(%r15),%r15
   0x00007ff5418a6627:  add    $0x8,%rbx
   0x00007ff5418a662b:  add    0x8(%rax),%r14
   0x00007ff5418a662f:  mov    -0x8(%rbx),%rdx
   0x00007ff5418a6633:  mov    %rdx,%rax
   0x00007ff5418a6636:  jmpq   *%rax
$7FF540107698 ;s ok

It's the same code as for the threaded code, but each static occurence
in the program gets a fresh copy of the code (and thus a fresh branch
predictor entry).  The decompiler does not show the code for ;s, but
that also has its own replica.

The replication+dynamic superinstruction variant looks as follows:

$7F750C566680 lit
$7F750C566688 <5> 
   0x00007f750dde72f5:  mov    %rbx,%rax
   0x00007f750dde72f8:  mov    %r14,(%r15)
   0x00007f750dde72fb:  lea    0x10(%rbx),%rbx
   0x00007f750dde72ff:  mov    (%rax),%r14
   0x00007f750dde7302:  sub    $0x8,%r15
$7F750C566690 +
   0x00007f750dde7306:  mov    %r15,%rax
   0x00007f750dde7309:  lea    0x8(%r15),%r15
   0x00007f750dde730d:  add    $0x8,%rbx
   0x00007f750dde7311:  add    0x8(%rax),%r14
$7F750C566698 ;s ok

Here the dispatch code from one VM instruction to the next is left
away, and each VM instruction falls through to the next VM
instruction, except control-flow VM instructions. Again the code for
;s is not shown, but is there.

Still to do: some benchmarks with one shared indirect branch
(switch-like).

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html


From anton Tue Aug 11 11:50:52 2015
X-newsreader: xrn 10.00-beta-3
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Subject: Re: Interpreter loops vs branch predictors
Path: mips.complang.tuwien.ac.at!anton
Newsgroups: comp.arch
References: <a1870856-6dfd-44f3-a94c-5f444f545e93@googlegroups.com> <2015Aug10.194919@mips.complang.tuwien.ac.at>
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Date: Tue, 11 Aug 2015 07:08:12 GMT
Message-ID: <2015Aug11.090812@mips.complang.tuwien.ac.at>

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>William Edwards <willvarfar@gmail.com> writes:
>>Are interpreter loops still slow?  https://hal.inria.fr/hal-01100647/document gives some interesting numbers and analysis into the state of modern branch predictors.   Nice paper :)
>
>If could only skim it up to now, but anyway, I just ran a few small
>benchmarks on an Ivy Bridge (Sandy Bridge shrink; Core i3-3227U CPU @
>1.90GHz), and got the following results:
>
>[b8:~/gforth:8519] for i in 'gforth-fast --ss-number=0 --ss-states=0 --no-dynamic' 'gforth-fast --ss-number=0 --ss-states=0 --no-super' 'gforth-fast --ss-number=0 --ss-states=0'; do echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done
>gforth-fast --ss-number=0 --ss-states=0 --no-dynamic onebench.fs
> sieve bubble matrix   fib   fft
> 0.261  0.325  0.202 0.474 0.235
>31205591 branch-misses
>gforth-fast --ss-number=0 --ss-states=0 --no-super onebench.fs
> sieve bubble matrix   fib   fft
> 0.264  0.331  0.196 0.459 0.141
>22584719 branch-misses
>gforth-fast --ss-number=0 --ss-states=0 onebench.fs
> sieve bubble matrix   fib   fft
> 0.194  0.268  0.115 0.271 0.073
>13624653 branch-misses
>
>The first is a threaded-code interpreter; the second employs
>replication for improved branch prediction, and indeed, that does not
>help for 4 out of 5 small benchmarks (but gives a good speedup on the
>fifth); there is also a reduction in branch misses (but for all
>benchmarks together).  The third one uses replication and dynamic
>superinstructions to get rid of a lot of the indirect branches, and
>more branch misses, and this provides a good speedup on all of the
>benchmarks.

Now I have rebuilt Gforth to have just one indirect branch in the
threaded-code version and a direct jump to it in the code for every VM
instruction, i.e. + looks as follows:

   0x0000000000405154 <gforth_engine+2996>:     mov    %r15,%rax
   0x0000000000405157 <gforth_engine+2999>:     lea    0x8(%r15),%r15
   0x000000000040515b <gforth_engine+3003>:     add    $0x8,%rbx
   0x000000000040515f <gforth_engine+3007>:     add    0x8(%rax),%r14
   0x0000000000405163 <gforth_engine+3011>:     mov    -0x8(%rbx),%rax
   0x0000000000405167 <gforth_engine+3015>:     jmpq   0x409926
   ...
   0x0000000000409926 <gforth_engine+21382>:    jmpq   *%rax

I.e, like a typical switch-based interpreter wrt the common indirect
branch (the switch-based interpreter has some additional overhead for
table lookup and range checking).  Here are the results:

gforth-fast --ss-number=0 --ss-states=0 --no-dynamic onebench.fs
 sieve bubble matrix   fib   fft
 0.486  0.476  0.350 0.838 0.450
71396358 branch-misses
gforth-fast --ss-number=0 --ss-states=0 --no-super onebench.fs
 sieve bubble matrix   fib   fft
 0.287  0.349  0.187 0.480 0.160
25935934 branch-misses
gforth-fast --ss-number=0 --ss-states=0 onebench.fs
 sieve bubble matrix   fib   fft
 0.213  0.295  0.114 0.285 0.077
14570355 branch-misses

Results: The --no-dynamic version is significantly slower than and has
more than twice as many branch mispredictions as --no-dynamic in the
version with one indirect branch per VM instruction.  So no, the
branch predictor in the Ivy Bridge does not make jumping to a shared
indirect branch (and therefore the usual implementation of
switch-based interpreters) perform just as well as threaded code.

The versions with dynamic replication (--no-super) and with dynamic
replication and dynamic superinstructions (no --no flag) do not jump
to the common indirect branch (at least not for relocatable VM
instruction code; they obviously have no other choice for
non-relocatable code), so the code looks and performs mostly the same
as for the earlier version.

These are small benchmarks, so I also ran the Forth appbench
benchmarks for hopefully more representative usage:

 shared  non-shared
  --no-       --no-      --no-
dynamic     dynamic      super  default
  14.90        7.44       4.97     3.33 benchgc
   5.34        4.01       2.64     1.93 brainless
  14.89       11.92       7.17     5.42 brew
  14.43       12.99       8.68     7.61 cd16sim
  10.98        7.33       4.15     3.69 fcp
  61.28       23.79      23.25    17.95 lexex

For mispredictions the picture was as follows:

      shared   non-shared
        --no-       --no-       --no-
      dynamic     dynamic       super     default
  757,565,893 251.351.601  40.012.497  26.125.431 benchgc
  332,346,911 244.665.607  80.940.467  68.982.970 brainless
  843,106,659 676.036.927 197.102.010 149.515.029 brew
  897,932,005 857.944.972 339.798.626 333.013.682 cd16sim
  718,043,211 441.514.166 121.160.576 150.935.485 fcp
2,590,120,945  96.395.237  41.278.385  24.488.573 lexex

For some benchmarks, the differences between the shared and non-shared
indirect branch variants are small, for others (especially lexex) huge
(factor 2.47 in performance, factor 26 in branch mispredictions).
Performing dynamic replication and dynamic superinstructions pays off.

I also ran my threaded-code microbenchmark
<https://www.complang.tuwien.ac.at/forth/threading/> on the Ivy
Bridge.  Results:

subrout direct indirect switch    call  repl-sw
  0.05    0.30    0.34    0.51    0.38    0.13  
Branch mispredictions:
for i in subroutine direitch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done
      4610 subroutine
   9539401 direct
  14614264 indirect
  20005474 switch
      5458 call
      4931 repl-switch

You can see that the performance is dominated by (indirect) branch
mispredictions.  This microbenchmark performs the following 10 VM
instructions in sequence and iterates 10M times:

noop1
noop2
noop1
noop3
noop1
noop4
noop1
loop

With threaded code and a BTB, one would expect a misprediction
whenever noop1 does its indirect branch to the next instruction, and
all the other indirect branches to predict correctly, resulting in a
50% miss rate (which was also observed for real programs).  A better
branch predictor does better (but the performance for this
microbenchmark may be unrepresentative of the performance on real code
in that case).

On Ivy Bridge I see 8-10% mispredictions for direct-threaded code,
14-16% for indirect threaded code (I would have expected the same
number of mispredictions), and 20% mispredictions for switch.

Interestingly the switch replication technique (where you have a
switch at the end of each VM instruction) predicts perfectly, and
therefore beats all the other intrepreter techniques (I would have
expected prediction performance like direct or indirect threaded
code).  Also the call technique (which has one shared indirect call)
predicts perfectly for this benchmark (I would have expected as many
mispredictions as for switch).

My conclusion: 

Intel has improved indirect branch prediction, but separate indirect
branches still pay off (sometimes hugely), and dynamic replication and
dynamic superinstructions also pay off nicely.  I guess this also
holds true for Haswell, but we'll see when there are experimental
results.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html


From anton Tue Aug 11 16:21:46 2015
X-newsreader: xrn 10.00-beta-3
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Subject: Re: Interpreter loops vs branch predictors
Path: mips.complang.tuwien.ac.at!anton
Newsgroups: comp.arch
References: <a1870856-6dfd-44f3-a94c-5f444f545e93@googlegroups.com> <2015Aug10.194919@mips.complang.tuwien.ac.at> <2015Aug11.090812@mips.complang.tuwien.ac.at>
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Date: Tue, 11 Aug 2015 14:08:37 GMT
Message-ID: <2015Aug11.160837@mips.complang.tuwien.ac.at>

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I also ran my threaded-code microbenchmark
><https://www.complang.tuwien.ac.at/forth/threading/> on the Ivy
>Bridge.  Results:
>
>subrout direct indirect switch    call  repl-sw
>  0.05    0.30    0.34    0.51    0.38    0.13  
>Branch mispredictions:
>for i in subroutine direitch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done
>      4610 subroutine
>   9539401 direct
>  14614264 indirect
>  20005474 switch
>      5458 call
>      4931 repl-switch

I looked closer at the produced code and found that gcc-4.8.2 -O
produces a shared indirect jump that is jumped to from every VM
instruction here.  I fixed this by using -O3, and got the following results:

subrout direct indirect switch    call  repl-sw
  0.01    0.05    0.05    1.06    1.00    0.25  

And mispredictions (this time with the correct command line):
for i in subroutine direct indirect switch call repl-switch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done
      3399 subroutine
      4556 direct
      4653 indirect
  60005523 switch
  60005667 call
      5207 repl-switch

This time 0% mispredictions from direct and indirect, and 60%
mispredictions from switch and call.  It's strange that the results
from switch and call are worse.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html


From anton Mon Sep  7 17:15:03 2015
X-newsreader: xrn 10.00-beta-3
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Subject: Re: Interpreter loops vs branch predictors
Path: mips.complang.tuwien.ac.at!anton
Newsgroups: comp.arch
References: <a1870856-6dfd-44f3-a94c-5f444f545e93@googlegroups.com> <2015Aug10.194919@mips.complang.tuwien.ac.at> <2015Aug11.090812@mips.complang.tuwien.ac.at> <4f9355a3-4f22-4428-9ef7-c5fb3e54a1fd@googlegroups.com>
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Date: Mon, 07 Sep 2015 12:25:07 GMT
Message-ID: <2015Sep7.142507@mips.complang.tuwien.ac.at>

Bruce Hoult <bruce.hoult@gmail.com> writes:
>Do you not have a Haswell machine available? I'm sure someone will have if =
>you have test binaries available.

I now have access to a Core i7-4790K (Haswell); here are the results
for the same things I have done before on the Ivy Bridge; Turbo Boost
was disabled on the Haswell (i.e., it ran with 4GHz).  Run time for
individual benchmarks in seconds user time, branch misses reported for
running them all.

for j in ' --ss-number=0 --ss-states=0 --no-dynamic' ' --ss-number=0 --ss-states=0 --no-super' ' --ss-number=0 --ss-states=0'; do for k in gforth-fast-4.8-PR15242 gforth-fast-4.8; do i=$k$j; echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done; done
gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs
 sieve bubble matrix   fib   fft
 0.264  0.196  0.132 0.228 0.088
13039125  branch-misses
gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs
 sieve bubble matrix   fib   fft
 0.112  0.152  0.092 0.164 0.060
11273919  branch-misses
gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-super onebench.fs
 sieve bubble matrix   fib   fft
 0.120  0.160  0.088 0.156 0.052
11457651  branch-misses
gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-super onebench.fs
 sieve bubble matrix   fib   fft
 0.112  0.152  0.088 0.152 0.056
11107473  branch-misses
gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 onebench.fs
 sieve bubble matrix   fib   fft
 0.096  0.128  0.052 0.100 0.032
8690210  branch-misses
gforth-fast-4.8 --ss-number=0 --ss-states=0 onebench.fs
 sieve bubble matrix   fib   fft
 0.088  0.120  0.052 0.088 0.036
7687416  branch-misses

gforth-fast-4.8-PR15242 is built with gcc-4.8, with one shared
indirect jump (that is jumped-to with direct jumps, similar to what
switch-based interpreters usually generate), while gforth-fast-4.8 has
a separate indirect jump at the end of each VM instruction
implementation; "--no-dynamic" (the first two) uses that directly.
"--no-super" uses dynamic replication (and gives each replicated
implementation its own indirect branch); and the last two runs use
dynamic superinstructions with replication (also with a separate
indirect branch at the end of each dynamic superinstruction).  See
<2015Aug10.194919@mips.complang.tuwien.ac.at> for a more detailed
description.

The branch mispredictions are indeed not that much different between
versions (all better than the best on the Ivy Bridge), but there is
still quite a bit of performance difference between the first two
(just from the extra direct jump); dynamic replication does not give a
better branch prediction on the Haswell, and consequently no better
performance than gforth-fast-4.8 --no-dynamic.  But superinstructions
with replications do provide a nice speedup again probably due to
fewer branches executed.

So, if you conclude from the paper that optimizing the dispatch of
interpreters is no longer relevant, that's not even true on Haswell
(and certainly not on any other CPU).  On mostr benchmarks the last
version is more than twice as fast as the first one, and the first one
is probably faster than a switch-based interpreter.

But the paper is correct that the branch prediction of the Haswell is
much better than that of earlier CPUs.  The difference is most extreme
for gforth-fast-4.8-PR15242 --no-dynamic: 71M mispredictions on Ivy
Bridge, 13M on Haswell
(Ivy Bridge results: <2015Aug11.090812@mips.complang.tuwien.ac.at>).

Let's look at more substantial benchmarks, from the Forth appbench
suite (Ivy Bridge results in the same posting
<2015Aug11.090812@mips.complang.tuwien.ac.at>):

Run time in seconds:

 shared  non-shared
  --no-       --no-      --no-
dynamic     dynamic      super  default
  3.332       2.440      2.276    1.468 benchgc  
  1.652       1.524      1.064    0.544 brainless
  4.016       3.416      2.288    1.520 brew     
  3.420       3.236      2.232    1.232 cd16sim  
  2.956       2.684      1.484    0.864 fcp      
 13.128       9.848      9.280    7.840 lexex    
  
Again, speedup factors 1.67-3.42 from interpreter optimizations from
the most switch-like version to the Gforth default, and also good
speedups for separate indirect jumps over the shared indirect jump.

For mispredictions the picture was as follows:

      shared   non-shared
        --no-       --no-       --no-
      dynamic     dynamic       super     default
   24,596,739  17,005,970  12,584,698   9,426,835 benchgc  
  163,953,064 178,234,917  51,375,412  18,391,920 brainless
  281,856,378 279,851,195  47,928,843  21,390,209 brew     
  268,684,916 297,000,355  70,879,605  22,105,620 cd16sim  
  255,802,864 293,473,631  33,496,775  17,280,944 fcp      
   39,426,063  12,267,065   8,269,104   6,885,712 lexex    

We don't see a consistent benefit from the separate indirect jumps
here; replication does help a lot for a number of benchmarks, though,
and dynamic superinstructions with replication even more.  These
differences in branch prediction can explain the differences in
run-time between the second and the third column.  So, even with
Haswell's branch predictor, there are significant branch prediction
benefits from replication resulting in significant speedups.  I'll
leave the explanation of these results to the branch predictor
experts.

The difference from Ivy Bridge
<2015Aug11.090812@mips.complang.tuwien.ac.at> is quite big, especially
for the first lexex column: 2590M mispredictions on Ivy Bridge, 39M on
Haswell.

Finally, the results from my interpreter dispatch microbenchmark
<https://www.complang.tuwien.ac.at/forth/threading/>
(cf. <2015Aug11.160837@mips.complang.tuwien.ac.at> for Ivy Bridge
results):

subrout direct indirect switch    call  repl-sw
  0.00    0.02    0.02    0.14    0.15    0.11  

Branch mispredictions:
for i in subroutine direct indirect switch call repl-switch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done
      3835 subroutine
      3302 direct
      3863 indirect
      9083 switch
      4419 call
      6018 repl-switch

Close to perfect branch prediction for all variants (there are 100M
indirect branches/calls in the benchmarks except subroutine).  There
is still a speed difference, probably due to the additional overhead
of switch, call, and repl-switch: direct and indirect use 1 cycle/VM
instruction, switch 5.8, call 6.1, and repl-switch 4.4.  It's
interesting that, even without the branch prediction advantage that
was the motivation for repl-switch, repl-switch is still quite a bit
faster than switch (at least for this benchmark, for bigger programs
the larger cache load from the replicated switch tables could become a
problem.

My conclusion: Haswell's indirect branch prediction is really much
better than before, but for larger programs running on an interpreter
like Gforth, replication still provides substantial branch prediction
improvements that result in significant speedups.  And while there is
no longer a branch prediction advantage to keeping separate indirect
branches for dispatch, there is still a significant speed advantage;
and dynamic superinstructions also provide a good speedup, resulting
in overall speedups by factors 1.67-3.42 for the application
benchmarks.

Why are the results here different from those in the paper?
1) Different Interpreter 2) different benchmarks.  If you write an
interpreter, and look for performance, should you go for interpreter
optimizations like threaded-code, replication, and dynamic
superinstructions like I suggest, or just use a switch-based
interpreter like the paper suggests?  Threaded code is a good idea
with little cost in any case.  If that provides a significant speedup
and your VM instruction implementations are short (you run into cache
trouble with long VM instructions [vitale&abdelrahman04]), then
replication with superinstructions will probably give a good speedup.

@InProceedings{vitale&abdelrahman04,
  author =	 {Benjamin Vitale and Tarek S. Abdelrahman},
  title =	 {Catenation and Specialization for {Tcl} Virtual
                  Machine Performance},
  crossref =	 {ivme04},
  pages =	 {42--50},
  year =	 {2004},
  OPTannote =	 {}
}

@Proceedings{ivme04,
  booktitle = 	 {IVME '04 Proceedings},
  title = 	 {IVME '04 Proceedings},
  year = 	 {2004},
  OPTeditor = 	 {}
}

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html