The article describes this as follows: There are tables of addresses of functions, and a function pointer is represented by the address of the table entry. A dispatch function (one for each signature) checks whether the "function pointer" is inside the table, and if so, calls the function pointed to by the entry.
Nowadays, in security-critical code (such as code where you would enable control-flow integrity) you use retpolines instead of indirect branches/calls, in order to work around Spectre; retpolines cost 20 cycles or more because they guarantee a branch misprediction.
So I wondered if instead of the scheme described above one would not be better off by using a hard-coded binary search among the possible function pointers and then use direct jumps/calls to the targets. Here a function pointer is represented by an integer 0..n-1, if there are n target functions.
Later I refined this into a more B-Tree like scheme: discern among as many targets (for a leaf) or subtrees (for an internal tree node) as fits in a cache line, even at the cost of having a more linear search within a cache line. This allows us to decide between 8 targets in a leaf node and 8 subtrees in an internal node, at the cost of up to 5 executed branches per node. In the special case of a two-level tree, making use of short branches allows 9 subtrees in the root node, and 9 targets in one of the leaf nodes, for a total of 73 targets.
Paf asked about the cost of this scheme, so I wrote this microbenchmark. There are a number of cost aspects:
By contrast, the table-using code (with retpolines) costs 27 bytes itself, 21 bytes in a different section for the retpoline code (which may be shared among several users), and a 8 bytes per target, 8 targets per cache line (i.e., the table alone is as big as the hard-coded B-tree-like search).
In terms of cache misses, if we assume that the code and tables are no longer in the L1 caches (but the retpoline code is), the binary search for 7 functions will encounter 1 I-cache miss, while the table-using version encounders 1 I-cache and 1 D-cache miss.
If we have more than 7 functions, the binary search will use more cache lines. And if we care about that, we will prefer the B-tree-like variant over the pure binary search. This allows us to call one of 73 targets with two cache accesses (i.e., a comparable cost to the table-using variant). For more targets, we need more I-cache accesses. With 3 I-cache accesses, we can reach at least 512 targets, and with 4, at least 4096 targets.
If you know which functions are called how often, you can rebalance the binary search to minimize the number of average cache misses.
Of course, the slowdown from retpolines of the table-using code in hot code may outweigh the cache-miss cost of the binary search in cold code.
However, the random case is not very realistic, and one will have to measure production code to see how well the predictors behave. For stuff like Linux VFS code, I expect the conditional branch predictor to perform very well if you use tree search for dispatch.
Each iteration of this microbenchmark measures a dispatch, a loop around it, and a little bit of work done by the called function (basically add a constant to the input value and return it). It measures the in-cache and predictable-branch preformance (except that the retpoline variant throws the predictability away). There are still some things that are not clear under these circumstances. Optimization guides tell me that, e.g., Skylake can process 2 branches per cycles, but only one taken branch per cycle; they also tell me that, if there are too many branches in a 16-byte fetch unit, this will overload the branch predictor. The binary search dispatch is pretty dense in branches, and the B-tree like search even more so, so I expected this to have an effect.
The dispatch is among 7 different target functions for the binary search (method 1) an among 73 for the B-tree-like search (method 3).
Anyway, here are the results in cycles per iteration (on Sandy Bridge (Xeon E3-1220), Skylake (Core i5-6600K), Zen (Ryzen 7 1800X), Zen2 (Ryzen 9 3900X), Zen3 (Ryzen 7 5800X)), explanations below:
Sandy Skylk Zen Zen2 Zen3 6.01 5.01 5.04 5.01 5.01 dispatch 0 6 9.01 7.01 8.02 7.03 7.01 dispatch 2 6 32.03 38.15 57.09 45.96 24.01 dispatch-retpol 0 6 36.03 43.11 60.08 49.33 26.01 dispatch-retpol 2 6 9.01 7.59 6.02 6.01 6.01 dispatch-retpol 1 0 9.01 7.62 6.02 6.01 6.01 dispatch-retpol 1 1 12.02 8.57 8.02 8.02 8.01 dispatch-retpol 1 2 12.02 8.58 8.02 8.02 8.01 dispatch-retpol 1 3 10.01 6.01 7.03 7.01 7.01 dispatch-retpol 1 4 10.01 6.01 7.02 7.01 7.01 dispatch-retpol 1 5 12.01 7.01 8.02 8.02 8.01 dispatch-retpol 1 6 10.01 8.01 7.01 7.01 7.02 dispatch-retpol 4 0 10.01 8.01 7.01 7.01 7.01 dispatch-retpol 4 1 10.01 8.02 7.02 7.01 7.02 dispatch-retpol 4 2 10.01 8.02 7.01 7.03 7.01 dispatch-retpol 4 3 8.01 7.49 6.01 6.01 6.01 dispatch-retpol 4 4 8.01 7.54 6.01 6.01 6.01 dispatch-retpol 4 5 8.01 7.57 6.01 6.01 6.02 dispatch-retpol 4 6 8.01 7.55 6.02 6.01 6.01 dispatch-retpol 4 7 11-15 7-9 8-10 8-10 8-10 dispatch-retpol 3 *"dispatch" is the benchmark compiled with
clang-11.0 -Oswithout retpolines, dispatch-retpol is compiled with
clang -Os -mretpoline. The first parameter specifies which dispatcher to use:
The results show that, for 7 target functions, the binary search is competetive with the no-retpoline version of the control-flow-integrity-checked indirect call. The single-level B-tree-like search has similar, sometimes even better performance to the binary tree, so the increase in executed branches does not hurt. The two-level B-Tree-like variant is typically ~2-3 cycles slower than the single-level one, so another B-tree level is relatively cheap as long as there are no I-cache misses and branch mispredictions.
Once retpolines come into play, the tree-based searches are quite a lot faster. The picture may look less rosy if you have more targets and the cache misses play a role, or if branch mispredictions come into play, but that actually needs measurements (with performance monitoring counters) in actual production code.
The effect of having too many branches in a 16-byte block is not really noticable, neither in Skylake nor in the other CPUs.