From anton Mon Aug 10 20:11:34 2015 X-newsreader: xrn 10.00-beta-3 From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Subject: Re: Interpreter loops vs branch predictors Path: mips.complang.tuwien.ac.at!anton Newsgroups: comp.arch References: Organization: Institut fuer Computersprachen, Technische Universitaet Wien Date: Mon, 10 Aug 2015 17:49:19 GMT Message-ID: <2015Aug10.194919@mips.complang.tuwien.ac.at> William Edwards writes: >Are interpreter loops still slow? https://hal.inria.fr/hal-01100647/document gives some interesting numbers and analysis into the state of modern branch predictors. Nice paper :) If could only skim it up to now, but anyway, I just ran a few small benchmarks on an Ivy Bridge (Sandy Bridge shrink; Core i3-3227U CPU @ 1.90GHz), and got the following results: [b8:~/gforth:8519] for i in 'gforth-fast --ss-number=0 --ss-states=0 --no-dynamic' 'gforth-fast --ss-number=0 --ss-states=0 --no-super' 'gforth-fast --ss-number=0 --ss-states=0'; do echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done gforth-fast --ss-number=0 --ss-states=0 --no-dynamic onebench.fs sieve bubble matrix fib fft 0.261 0.325 0.202 0.474 0.235 31205591 branch-misses gforth-fast --ss-number=0 --ss-states=0 --no-super onebench.fs sieve bubble matrix fib fft 0.264 0.331 0.196 0.459 0.141 22584719 branch-misses gforth-fast --ss-number=0 --ss-states=0 onebench.fs sieve bubble matrix fib fft 0.194 0.268 0.115 0.271 0.073 13624653 branch-misses The first is a threaded-code interpreter; the second employs replication for improved branch prediction, and indeed, that does not help for 4 out of 5 small benchmarks (but gives a good speedup on the fifth); there is also a reduction in branch misses (but for all benchmarks together). The third one uses replication and dynamic superinstructions to get rid of a lot of the indirect branches, and more branch misses, and this provides a good speedup on all of the benchmarks. Here are some code samples, all for: : foo 5 + ; see-code foo The threaded-code variant (--no-dynamic) decompiles as: $7F61B00F2680 lit $7F61B00F2688 <5> $7F61B00F2690 + $7F61B00F2698 ;s ok All the VM instructions point to the original code (and that is not disassembled by SEE-CODE), e.g.: Code + 0x00000000004050c7 : mov %r15,%rax 0x00000000004050ca : lea 0x8(%r15),%r15 0x00000000004050ce : add $0x8,%rbx 0x00000000004050d2 : add 0x8(%rax),%r14 0x00000000004050d6 : mov -0x8(%rbx),%rdx 0x00000000004050da : mov %rdx,%rax 0x00000000004050dd : jmpq *%rax end-code The replication variant (--no-super) looks as follows: $7FF540107680 lit $7FF540107688 <5> 0x00007ff5418a6605: mov %rbx,%rax 0x00007ff5418a6608: mov %r14,(%r15) 0x00007ff5418a660b: lea 0x10(%rbx),%rbx 0x00007ff5418a660f: mov (%rax),%r14 0x00007ff5418a6612: sub $0x8,%r15 0x00007ff5418a6616: mov -0x8(%rbx),%rdx 0x00007ff5418a661a: mov %rdx,%rax 0x00007ff5418a661d: jmpq *%rax 0x00007ff5418a661f: nop $7FF540107690 + 0x00007ff5418a6620: mov %r15,%rax 0x00007ff5418a6623: lea 0x8(%r15),%r15 0x00007ff5418a6627: add $0x8,%rbx 0x00007ff5418a662b: add 0x8(%rax),%r14 0x00007ff5418a662f: mov -0x8(%rbx),%rdx 0x00007ff5418a6633: mov %rdx,%rax 0x00007ff5418a6636: jmpq *%rax $7FF540107698 ;s ok It's the same code as for the threaded code, but each static occurence in the program gets a fresh copy of the code (and thus a fresh branch predictor entry). The decompiler does not show the code for ;s, but that also has its own replica. The replication+dynamic superinstruction variant looks as follows: $7F750C566680 lit $7F750C566688 <5> 0x00007f750dde72f5: mov %rbx,%rax 0x00007f750dde72f8: mov %r14,(%r15) 0x00007f750dde72fb: lea 0x10(%rbx),%rbx 0x00007f750dde72ff: mov (%rax),%r14 0x00007f750dde7302: sub $0x8,%r15 $7F750C566690 + 0x00007f750dde7306: mov %r15,%rax 0x00007f750dde7309: lea 0x8(%r15),%r15 0x00007f750dde730d: add $0x8,%rbx 0x00007f750dde7311: add 0x8(%rax),%r14 $7F750C566698 ;s ok Here the dispatch code from one VM instruction to the next is left away, and each VM instruction falls through to the next VM instruction, except control-flow VM instructions. Again the code for ;s is not shown, but is there. Still to do: some benchmarks with one shared indirect branch (switch-like). - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html From anton Tue Aug 11 11:50:52 2015 X-newsreader: xrn 10.00-beta-3 From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Subject: Re: Interpreter loops vs branch predictors Path: mips.complang.tuwien.ac.at!anton Newsgroups: comp.arch References: <2015Aug10.194919@mips.complang.tuwien.ac.at> Organization: Institut fuer Computersprachen, Technische Universitaet Wien Date: Tue, 11 Aug 2015 07:08:12 GMT Message-ID: <2015Aug11.090812@mips.complang.tuwien.ac.at> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >William Edwards writes: >>Are interpreter loops still slow? https://hal.inria.fr/hal-01100647/document gives some interesting numbers and analysis into the state of modern branch predictors. Nice paper :) > >If could only skim it up to now, but anyway, I just ran a few small >benchmarks on an Ivy Bridge (Sandy Bridge shrink; Core i3-3227U CPU @ >1.90GHz), and got the following results: > >[b8:~/gforth:8519] for i in 'gforth-fast --ss-number=0 --ss-states=0 --no-dynamic' 'gforth-fast --ss-number=0 --ss-states=0 --no-super' 'gforth-fast --ss-number=0 --ss-states=0'; do echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done >gforth-fast --ss-number=0 --ss-states=0 --no-dynamic onebench.fs > sieve bubble matrix fib fft > 0.261 0.325 0.202 0.474 0.235 >31205591 branch-misses >gforth-fast --ss-number=0 --ss-states=0 --no-super onebench.fs > sieve bubble matrix fib fft > 0.264 0.331 0.196 0.459 0.141 >22584719 branch-misses >gforth-fast --ss-number=0 --ss-states=0 onebench.fs > sieve bubble matrix fib fft > 0.194 0.268 0.115 0.271 0.073 >13624653 branch-misses > >The first is a threaded-code interpreter; the second employs >replication for improved branch prediction, and indeed, that does not >help for 4 out of 5 small benchmarks (but gives a good speedup on the >fifth); there is also a reduction in branch misses (but for all >benchmarks together). The third one uses replication and dynamic >superinstructions to get rid of a lot of the indirect branches, and >more branch misses, and this provides a good speedup on all of the >benchmarks. Now I have rebuilt Gforth to have just one indirect branch in the threaded-code version and a direct jump to it in the code for every VM instruction, i.e. + looks as follows: 0x0000000000405154 : mov %r15,%rax 0x0000000000405157 : lea 0x8(%r15),%r15 0x000000000040515b : add $0x8,%rbx 0x000000000040515f : add 0x8(%rax),%r14 0x0000000000405163 : mov -0x8(%rbx),%rax 0x0000000000405167 : jmpq 0x409926 ... 0x0000000000409926 : jmpq *%rax I.e, like a typical switch-based interpreter wrt the common indirect branch (the switch-based interpreter has some additional overhead for table lookup and range checking). Here are the results: gforth-fast --ss-number=0 --ss-states=0 --no-dynamic onebench.fs sieve bubble matrix fib fft 0.486 0.476 0.350 0.838 0.450 71396358 branch-misses gforth-fast --ss-number=0 --ss-states=0 --no-super onebench.fs sieve bubble matrix fib fft 0.287 0.349 0.187 0.480 0.160 25935934 branch-misses gforth-fast --ss-number=0 --ss-states=0 onebench.fs sieve bubble matrix fib fft 0.213 0.295 0.114 0.285 0.077 14570355 branch-misses Results: The --no-dynamic version is significantly slower than and has more than twice as many branch mispredictions as --no-dynamic in the version with one indirect branch per VM instruction. So no, the branch predictor in the Ivy Bridge does not make jumping to a shared indirect branch (and therefore the usual implementation of switch-based interpreters) perform just as well as threaded code. The versions with dynamic replication (--no-super) and with dynamic replication and dynamic superinstructions (no --no flag) do not jump to the common indirect branch (at least not for relocatable VM instruction code; they obviously have no other choice for non-relocatable code), so the code looks and performs mostly the same as for the earlier version. These are small benchmarks, so I also ran the Forth appbench benchmarks for hopefully more representative usage: shared non-shared --no- --no- --no- dynamic dynamic super default 14.90 7.44 4.97 3.33 benchgc 5.34 4.01 2.64 1.93 brainless 14.89 11.92 7.17 5.42 brew 14.43 12.99 8.68 7.61 cd16sim 10.98 7.33 4.15 3.69 fcp 61.28 23.79 23.25 17.95 lexex For mispredictions the picture was as follows: shared non-shared --no- --no- --no- dynamic dynamic super default 757,565,893 251.351.601 40.012.497 26.125.431 benchgc 332,346,911 244.665.607 80.940.467 68.982.970 brainless 843,106,659 676.036.927 197.102.010 149.515.029 brew 897,932,005 857.944.972 339.798.626 333.013.682 cd16sim 718,043,211 441.514.166 121.160.576 150.935.485 fcp 2,590,120,945 96.395.237 41.278.385 24.488.573 lexex For some benchmarks, the differences between the shared and non-shared indirect branch variants are small, for others (especially lexex) huge (factor 2.47 in performance, factor 26 in branch mispredictions). Performing dynamic replication and dynamic superinstructions pays off. I also ran my threaded-code microbenchmark on the Ivy Bridge. Results: subrout direct indirect switch call repl-sw 0.05 0.30 0.34 0.51 0.38 0.13 Branch mispredictions: for i in subroutine direitch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done 4610 subroutine 9539401 direct 14614264 indirect 20005474 switch 5458 call 4931 repl-switch You can see that the performance is dominated by (indirect) branch mispredictions. This microbenchmark performs the following 10 VM instructions in sequence and iterates 10M times: noop1 noop2 noop1 noop3 noop1 noop4 noop1 loop With threaded code and a BTB, one would expect a misprediction whenever noop1 does its indirect branch to the next instruction, and all the other indirect branches to predict correctly, resulting in a 50% miss rate (which was also observed for real programs). A better branch predictor does better (but the performance for this microbenchmark may be unrepresentative of the performance on real code in that case). On Ivy Bridge I see 8-10% mispredictions for direct-threaded code, 14-16% for indirect threaded code (I would have expected the same number of mispredictions), and 20% mispredictions for switch. Interestingly the switch replication technique (where you have a switch at the end of each VM instruction) predicts perfectly, and therefore beats all the other intrepreter techniques (I would have expected prediction performance like direct or indirect threaded code). Also the call technique (which has one shared indirect call) predicts perfectly for this benchmark (I would have expected as many mispredictions as for switch). My conclusion: Intel has improved indirect branch prediction, but separate indirect branches still pay off (sometimes hugely), and dynamic replication and dynamic superinstructions also pay off nicely. I guess this also holds true for Haswell, but we'll see when there are experimental results. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html From anton Tue Aug 11 16:21:46 2015 X-newsreader: xrn 10.00-beta-3 From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Subject: Re: Interpreter loops vs branch predictors Path: mips.complang.tuwien.ac.at!anton Newsgroups: comp.arch References: <2015Aug10.194919@mips.complang.tuwien.ac.at> <2015Aug11.090812@mips.complang.tuwien.ac.at> Organization: Institut fuer Computersprachen, Technische Universitaet Wien Date: Tue, 11 Aug 2015 14:08:37 GMT Message-ID: <2015Aug11.160837@mips.complang.tuwien.ac.at> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >I also ran my threaded-code microbenchmark > on the Ivy >Bridge. Results: > >subrout direct indirect switch call repl-sw > 0.05 0.30 0.34 0.51 0.38 0.13 >Branch mispredictions: >for i in subroutine direitch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done > 4610 subroutine > 9539401 direct > 14614264 indirect > 20005474 switch > 5458 call > 4931 repl-switch I looked closer at the produced code and found that gcc-4.8.2 -O produces a shared indirect jump that is jumped to from every VM instruction here. I fixed this by using -O3, and got the following results: subrout direct indirect switch call repl-sw 0.01 0.05 0.05 1.06 1.00 0.25 And mispredictions (this time with the correct command line): for i in subroutine direct indirect switch call repl-switch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done 3399 subroutine 4556 direct 4653 indirect 60005523 switch 60005667 call 5207 repl-switch This time 0% mispredictions from direct and indirect, and 60% mispredictions from switch and call. It's strange that the results from switch and call are worse. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html From anton Mon Sep 7 17:15:03 2015 X-newsreader: xrn 10.00-beta-3 From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Subject: Re: Interpreter loops vs branch predictors Path: mips.complang.tuwien.ac.at!anton Newsgroups: comp.arch References: <2015Aug10.194919@mips.complang.tuwien.ac.at> <2015Aug11.090812@mips.complang.tuwien.ac.at> <4f9355a3-4f22-4428-9ef7-c5fb3e54a1fd@googlegroups.com> Organization: Institut fuer Computersprachen, Technische Universitaet Wien Date: Mon, 07 Sep 2015 12:25:07 GMT Message-ID: <2015Sep7.142507@mips.complang.tuwien.ac.at> Bruce Hoult writes: >Do you not have a Haswell machine available? I'm sure someone will have if = >you have test binaries available. I now have access to a Core i7-4790K (Haswell); here are the results for the same things I have done before on the Ivy Bridge; Turbo Boost was disabled on the Haswell (i.e., it ran with 4GHz). Run time for individual benchmarks in seconds user time, branch misses reported for running them all. for j in ' --ss-number=0 --ss-states=0 --no-dynamic' ' --ss-number=0 --ss-states=0 --no-super' ' --ss-number=0 --ss-states=0'; do for k in gforth-fast-4.8-PR15242 gforth-fast-4.8; do i=$k$j; echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done; done gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs sieve bubble matrix fib fft 0.264 0.196 0.132 0.228 0.088 13039125 branch-misses gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs sieve bubble matrix fib fft 0.112 0.152 0.092 0.164 0.060 11273919 branch-misses gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-super onebench.fs sieve bubble matrix fib fft 0.120 0.160 0.088 0.156 0.052 11457651 branch-misses gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-super onebench.fs sieve bubble matrix fib fft 0.112 0.152 0.088 0.152 0.056 11107473 branch-misses gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 onebench.fs sieve bubble matrix fib fft 0.096 0.128 0.052 0.100 0.032 8690210 branch-misses gforth-fast-4.8 --ss-number=0 --ss-states=0 onebench.fs sieve bubble matrix fib fft 0.088 0.120 0.052 0.088 0.036 7687416 branch-misses gforth-fast-4.8-PR15242 is built with gcc-4.8, with one shared indirect jump (that is jumped-to with direct jumps, similar to what switch-based interpreters usually generate), while gforth-fast-4.8 has a separate indirect jump at the end of each VM instruction implementation; "--no-dynamic" (the first two) uses that directly. "--no-super" uses dynamic replication (and gives each replicated implementation its own indirect branch); and the last two runs use dynamic superinstructions with replication (also with a separate indirect branch at the end of each dynamic superinstruction). See <2015Aug10.194919@mips.complang.tuwien.ac.at> for a more detailed description. The branch mispredictions are indeed not that much different between versions (all better than the best on the Ivy Bridge), but there is still quite a bit of performance difference between the first two (just from the extra direct jump); dynamic replication does not give a better branch prediction on the Haswell, and consequently no better performance than gforth-fast-4.8 --no-dynamic. But superinstructions with replications do provide a nice speedup again probably due to fewer branches executed. So, if you conclude from the paper that optimizing the dispatch of interpreters is no longer relevant, that's not even true on Haswell (and certainly not on any other CPU). On mostr benchmarks the last version is more than twice as fast as the first one, and the first one is probably faster than a switch-based interpreter. But the paper is correct that the branch prediction of the Haswell is much better than that of earlier CPUs. The difference is most extreme for gforth-fast-4.8-PR15242 --no-dynamic: 71M mispredictions on Ivy Bridge, 13M on Haswell (Ivy Bridge results: <2015Aug11.090812@mips.complang.tuwien.ac.at>). Let's look at more substantial benchmarks, from the Forth appbench suite (Ivy Bridge results in the same posting <2015Aug11.090812@mips.complang.tuwien.ac.at>): Run time in seconds: shared non-shared --no- --no- --no- dynamic dynamic super default 3.332 2.440 2.276 1.468 benchgc 1.652 1.524 1.064 0.544 brainless 4.016 3.416 2.288 1.520 brew 3.420 3.236 2.232 1.232 cd16sim 2.956 2.684 1.484 0.864 fcp 13.128 9.848 9.280 7.840 lexex Again, speedup factors 1.67-3.42 from interpreter optimizations from the most switch-like version to the Gforth default, and also good speedups for separate indirect jumps over the shared indirect jump. For mispredictions the picture was as follows: shared non-shared --no- --no- --no- dynamic dynamic super default 24,596,739 17,005,970 12,584,698 9,426,835 benchgc 163,953,064 178,234,917 51,375,412 18,391,920 brainless 281,856,378 279,851,195 47,928,843 21,390,209 brew 268,684,916 297,000,355 70,879,605 22,105,620 cd16sim 255,802,864 293,473,631 33,496,775 17,280,944 fcp 39,426,063 12,267,065 8,269,104 6,885,712 lexex We don't see a consistent benefit from the separate indirect jumps here; replication does help a lot for a number of benchmarks, though, and dynamic superinstructions with replication even more. These differences in branch prediction can explain the differences in run-time between the second and the third column. So, even with Haswell's branch predictor, there are significant branch prediction benefits from replication resulting in significant speedups. I'll leave the explanation of these results to the branch predictor experts. The difference from Ivy Bridge <2015Aug11.090812@mips.complang.tuwien.ac.at> is quite big, especially for the first lexex column: 2590M mispredictions on Ivy Bridge, 39M on Haswell. Finally, the results from my interpreter dispatch microbenchmark (cf. <2015Aug11.160837@mips.complang.tuwien.ac.at> for Ivy Bridge results): subrout direct indirect switch call repl-sw 0.00 0.02 0.02 0.14 0.15 0.11 Branch mispredictions: for i in subroutine direct indirect switch call repl-switch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done 3835 subroutine 3302 direct 3863 indirect 9083 switch 4419 call 6018 repl-switch Close to perfect branch prediction for all variants (there are 100M indirect branches/calls in the benchmarks except subroutine). There is still a speed difference, probably due to the additional overhead of switch, call, and repl-switch: direct and indirect use 1 cycle/VM instruction, switch 5.8, call 6.1, and repl-switch 4.4. It's interesting that, even without the branch prediction advantage that was the motivation for repl-switch, repl-switch is still quite a bit faster than switch (at least for this benchmark, for bigger programs the larger cache load from the replicated switch tables could become a problem. My conclusion: Haswell's indirect branch prediction is really much better than before, but for larger programs running on an interpreter like Gforth, replication still provides substantial branch prediction improvements that result in significant speedups. And while there is no longer a branch prediction advantage to keeping separate indirect branches for dispatch, there is still a significant speed advantage; and dynamic superinstructions also provide a good speedup, resulting in overall speedups by factors 1.67-3.42 for the application benchmarks. Why are the results here different from those in the paper? 1) Different Interpreter 2) different benchmarks. If you write an interpreter, and look for performance, should you go for interpreter optimizations like threaded-code, replication, and dynamic superinstructions like I suggest, or just use a switch-based interpreter like the paper suggests? Threaded code is a good idea with little cost in any case. If that provides a significant speedup and your VM instruction implementations are short (you run into cache trouble with long VM instructions [vitale&abdelrahman04]), then replication with superinstructions will probably give a good speedup. @InProceedings{vitale&abdelrahman04, author = {Benjamin Vitale and Tarek S. Abdelrahman}, title = {Catenation and Specialization for {Tcl} Virtual Machine Performance}, crossref = {ivme04}, pages = {42--50}, year = {2004}, OPTannote = {} } @Proceedings{ivme04, booktitle = {IVME '04 Proceedings}, title = {IVME '04 Proceedings}, year = {2004}, OPTeditor = {} } - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html