for (i=0; i<250000000; i++) { *b = *a+1; *d = *c+1; *f = *e+1; *h = *g+1; }The eight parameters of the binary allow you to determine to which memory location the pointers
a b c d e f g h
refer, in
particular, if some of them refer to the same memory location or not.
Each parameter corresponds to one pointer, and two pointers refer to
the same memory location iff the corresponding parameters are the
same.
This microbenchmark uses pointers that have been in their registers for a long time, so it does not test speculative alias detection by the hardware.
I have measured using the following parameter combinations (in the
order a b c d e f g h
):
0 8 16 24 32 40 48 56
: All
computations (statements in the loop body) are completely
independent. Ideally they can be performed as fast as the resources
allow, but all accesses are to the starts of different cache lines.
This may show up some resource limitations.
0 9 18 27 36 45 54 63
:
Like X, but this time all memory accesses are to
different parts of the cache lines.
0 513 994 11 524 989 22 535
:
Like Y, but this time the memory accesses are far
enough apart to probably hit three different pages in a round-robin
fashion.
0 1 2 3 4 5 6 7
: All
computations (statements in the loop body) are completely
independent. Ideally they can be performed as fast as the resources
allow. All accesses are in the same 64-byte block (possibly in the
same cache line).
0 1 1 2 2 3 3 4
: A sequence of 4
dependent computations, but the next iteration does not depend on
the results of the previous one, so again, the work can be performed
as fast as the resources allow. However, in order to achieve this
performance, the loads of the next iteration have to be started
while several architecturally earlier stores still have to be
executed, and, comparing the results of B to those of A, all
measured cores have more difficulty with that, resulting in slower
performance.
0 1 1 1 1 1 1 1
:
Like B, but this time all but the first access per
loop is to the same memory location. If the hardware can do
something like memory renaming, it can overlap the next iteration
with the current one, if not, this should be quite a bit slower
than B. Of course, on in-order microarchitectures
there is no overlap between iterations, so you won't see such a
difference there.
0 1 1 1 2 1 1 1
:
Like B1, bit instead of having a single chain of 4
dependent computations, now we have two chains of two dependent
computations. Both chains occupy the same memory location for a
while.
0 0 2 2 4 4 6 6
: 1-computation
recurrences (i.e., the computation in the next iteration depends on
a computation in the current iteration), four of
those. So at most 4 of these computations (plus the loop overhead)
can be performed in parallel.
0 1 1 0 2 3 3 2
: 2-computation
recurrences (i.e., two dependent computations in an iteration, and
the first of those in the current iteration depends on the second
one in the previous iteration), 2 of those. So at most two of these
computations (plus the loop overhead) can be performed in parallel.
0 1 2 3 1 0 3 2
: The same data
flow as D, but the computations are arranged
differently: Here we first have two independent computations, and
then two computations that depend on the earlier computations.
0 1 1 2 2 0 3 3
: A 3-computation
recurrence and a 1-computation recurrence. In the best case you see
the latency of three computations per iteration.
0 1 1 2 3 3 2 0
: The same data
flow as F, but the computations are arranged
differently.
0 0 0 0 0 0 0 0
: One
4-computation recurrence. These computations can only be performed
sequentialy, only the loop overhead can be performed in parallel to
them.
A B C D E F G H microarchitecture CPU 1.98 14.42 6.04 6.07 6.07 4.78 4.73 8.02 21264B (Alpha)
X Y Z A B B1 B2 C D E F G H microarchitecture CPU 3.01 3.01 3.01 3.01 4.52 4.52 4.01 3.01 4.02 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V) 3.01 3.02 3.02 3.01 6.02 6.01 5.27 3.01 5.27 5.01 5.38 5.39 7.01 PPC7447A iBook G4 (make TIME="/usr/bin/time -f %U 2>&1" MHZ=1066) 3.25 3.25 3.25 3.25 4.00 4.00 3.75 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905 3.25 3.25 3.25 3.25 4.00 4.00 3.75 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588 1.25 1.25 1.25 1.25 2.27 5.87 3.71 2.22 3.96 4.00 5.85 5.67 7.69 Cortex-A72 BCM2835 1.75 1.75 1.75 1.75 3.24 4.25 3.49 1.75 2.74 2.99 3.74 3.74 5.00 Cortex-A73 Amlogic S922 (taskset -c 5 make TIME="/usr/bin/time -f %U 2>&1" MHZ=1800) 1.72 1.50 1.50 1.50 1.77 1.89 1.16 1.63 2.98 3.02 4.11 4.10 5.51 Cortex-A76 RK3588 1.06 1.03 1.03 1.04 1.67 2.18 1.44 1.50 3.00 3.00 4.50 4.51 6.00 IceStorm Apple M1 (variations from 6.0-8.75 for H) 0.55 0.55 0.55 0.52 0.84 0.84 0.61 1.68 3.32 3.29 5.00 4.94 6.36 FireStorm Apple M1 (make TIME="/usr/bin/time -f %U 2>&1" MHZ=3228) 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 Bonnell Atom 330 2.00 2.00 2.05 2.00 3.07 4.25 3.50 2.00 3.52 2.96 5.09 5.46 6.70 Silvermont Celeron J1900 1.25 1.25 1.25 1.25 1.82 1.82 1.51 2.01 3.46 3.48 4.79 4.75 6.42 Goldmont Celeron J3455 1.25 1.25 1.25 1.25 1.68 1.68 1.50 1.67 3.00 3.00 4.50 4.50 6.00 Goldmont+ Celeron J4105 1.21 1.00 1.00 1.00 1.53 1.55 1.19 1.49 3.00 3.00 4.50 4.50 6.11 Tremont Celeron N4500 1.00 1.00 1.00 0.75 0.75 0.70 0.70 0.75 0.70 1.00 0.75 0.75 1.00 Gracemont Core i3-1315U (taskset -c 5 make CYCLES=cpu_atom/cycles:u/) 2.00 1.17 1.17 1.17 2.05 2.84 2.07 1.25 3.30 3.68 4.50 4.50 6.93 K8 Athlon 64 X2 4600+ 1.17 1.17 1.27 1.23 2.47 2.91 2.18 1.20 3.28 3.07 4.63 4.73 6.48 K10 Phenom II X2 560 1.25 1.25 1.25 1.33 1.76 1.63 1.34 1.64 4.34 4.16 6.24 6.36 8.96 Excavator Athlon X4 845 1.00 1.00 1.00 1.00 1.26 1.26 1.04 2.00 4.00 4.00 6.01 6.00 8.00 Zen Ryzen 1600X 1.00 1.00 1.00 1.00 1.19 1.19 1.00 2.00 4.00 4.00 6.00 6.00 8.00 Zen2 Ryzen 3900X 1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 5800X 2.19 1.38 1.38 1.38 1.78 3.37 2.50 1.38 3.02 3.02 4.55 4.50 6.00 Penryn Xeon E5450 1.26 1.24 1.26 1.25 1.72 1.72 1.50 1.46 3.01 3.01 4.51 4.51 6.03 Nehalem Xeon X3460 1.06 1.00 1.01 1.00 1.52 1.52 1.25 1.99 3.69 3.68 4.65 4.63 7.45 Sandy Bridge Xeon E3-1220 1.00 1.00 1.00 1.00 1.26 1.26 1.05 1.57 3.03 3.00 4.50 4.50 6.05 Haswell Core i7-4790K 1.00 1.00 1.00 1.00 1.12 1.21 1.04 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K 1.00 1.00 1.00 0.78 0.94 0.91 0.85 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P 1.00 1.00 1.00 0.81 0.81 0.81 0.81 0.75 0.81 0.84 0.82 0.86 1.00 Tiger Lake Core i5-1135G7 1.00 1.00 1.00 0.55 0.75 0.65 0.65 0.78 0.68 0.67 0.66 0.65 0.65 Golden Cove Core i3-1315U (taskset -c 2 make CYCLES=cpu_core/cycles:u/)There can be different sophistication levels of CPU cores, both in terms of dealing with aliases, and in terms of forwarding from stores to loads. And of course there is the difference between in-order execution and out-of-order execution.
On the plus side for Bonnell, A-H takes as long (in cycles) as A does on other in-order cores (e.g., the U74); and looking at the Bonnell pipeline, I really wonder how they do that: Given a three-cycle data cache access followed by a 1-cycle ALU operation, the results require a -1-cycle store-to-load latency, whether the addresses are the same or not. Maybe there is a bypass from the store to one of the later load-unit stages if the addresses match (and if they don't, starting the load early does not hurt).
The next approach is to allow to perform architecturally later loads at the same time, or, with OoO, earlier than stores to a different address. All tested cores seem to do this.
Finally, there is the issue of what to do when the load is to the same address as an architecturally preceding store. This brings us to the next section:
The next approach is to let the load read from the store buffer if the data is there (and first wait until it is there). In this case the whole sequence of store-load has a total latency that is not much higher than the load latency. It seems that most cores measured here do that. We see this best in the H results; e.g., a Skylake has 4 cycles of load latency and 1 cycle of add latency, and we see 5.53 cycles of latency for store-load-add, meaning that the store contributes an average latency of 0.53 cycles. There are some complications in case the load only partially overlaps the store (I should put an appropriate chipsncheese link here).
Finally, the core might detect that the data that is loaded is coming from a store that has not been retired yet, so the physical register used by the store can be directly renamed into the register of the load result, and the load does not need to access the store buffer or cache at all (what is the name of this feature?). As a result, in the ideal case we see only the 1-cycle latency of the add in case H. In the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully; Rocket Lake probably also does that, but either does not succeed in all cases (the different results between D and E speak for this theory), or there is an additional source of latency.
I expected that Firestorm (the performance core of the Apple M1) also has this feature, but, looking at the results, it does not.
B2 is similar to B1, but there are two 2-computation dependency chains instead of the one 4 computation-chain.
Intel added memory renaming in Goldmont and Nehalem, AMD added it between K10 and Excavator (probably already with Bulldozer). On the ARM side, only Firestorm (Apple M1 P-core) has memory renaming in the microarchitectures I measured.
On the PPC7447A, B1 has the same speed as B, but they are both so close to the H case that it seems more likely that the PPC7447A has very limited out-of-order capabilities and does reorder enough to allow more overlapping for B than for B1 without memory renaming.
With OoO, we see much better performance in cases where there are independent computation chains. Given the size of the buffers in the various OoO microarchitectures (hundreds of instructions in the reorder buffer, dozens in schedulers), it is surprising that B is slower than A given the small size of each iteration (~15 instructions); and even D is slower than E on some microarchitectures, most notably Rocket Lake.
wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip unzip memdep.zip cd memdep makeIf you are working on a machine without working
perf
, you
can use the following instead of the make
in the last line:
make TIME="/bin/time -f %U 2>&1" MHZ=...where you have to replace the
...
with the actual (not
base) clock rate of the processor you are measuring.
If you want to do your own parameter combinations, you can run the
binary with
./memdep-`uname -m` a b c d e f g hwhere
a b c d e f g h
are integers and correspond to the
pointers in the loop. If you want to get results like in the table
above, run it like this:
perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'
I have done a microbenchmark that was intended to measure predictive store forwarding, but it (also?) measured the forward-at-register-level technique described above.
![]() | Name | Last modified | Size | Description |
---|---|---|---|---|
![]() | Parent Directory | - | ||
![]() | memdep.zip | 2023-11-15 12:30 | 1.4M | |
![]() | memdep-riscv64 | 2023-11-02 13:12 | 1.3M | |
![]() | memdep-x86_64 | 2023-11-02 13:12 | 764K | |
![]() | memdep-ppc | 2023-11-03 18:08 | 612K | |
![]() | memdep-alpha | 2023-11-03 14:12 | 605K | |
![]() | memdep-aarch64 | 2023-11-02 13:12 | 587K | |
![]() | main-alpha.o | 2023-11-03 14:12 | 2.6K | |
![]() | main-riscv64.o | 2023-11-02 13:10 | 2.6K | |
![]() | main-aarch64.o | 2023-11-02 13:09 | 2.3K | |
![]() | main-x86_64.o | 2023-11-02 11:19 | 2.2K | |
![]() | Makefile | 2023-11-14 22:55 | 1.9K | |
![]() | measure-x86_64 | 2023-11-14 22:29 | 1.7K | |
![]() | main-ppc.o | 2023-11-03 18:08 | 1.5K | |
![]() | memdep-aarch64.o | 2023-11-02 13:09 | 1.3K | |
![]() | memdep-x86_64.o | 2023-11-02 11:19 | 1.3K | |
![]() | memdep-alpha.o | 2023-11-03 14:12 | 1.2K | |
![]() | memdep-riscv64.o | 2023-11-02 13:10 | 1.1K | |
![]() | memdep-ppc.o | 2023-11-03 18:08 | 869 | |
![]() | main.c | 2023-11-01 19:55 | 691 | |
![]() | memdep.c | 2023-11-01 19:42 | 211 | |