stores size stride offsetIt performs AVX256 (32-byte) stores in the page-aligned region of size size, one store every stride bytes, starting at offset offset from the start of the page. It performs stores until the region ends, then repeats that, until the number of stores exceeds 10^9. The native code for the two inner loops is:
┌─→mov %rdi,%rax │┌→vmovdqu %ymm0,(%rax) ││ add %rcx,%rax ││ cmp %rax,%rsi │└─jae 2b │ mov %rdx,%rax │ lea (%rdx,%r8,1),%rdx │ cmp $0x3b9ac9ff,%rax └──jbe 28The
vmovdqu
is the store.
I first present the various runs to measure different features by showing several performance counters on a Haswell (Core i7-4790K); later I show just the cycles/store (plus overhead) for various hardware.
Let's start with the best case: aligned adjacent stores in the L1 cache with high trip counts (511 iterations) for the inner loop:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 16352 32 0 1053801504 cycles 4010346927 instructions 1000085309 L1-dcache-stores 1002064930 branches 1961365 branch-missesWe see 1 cycle/store plus a little overhead. We also see one branch misprediction every time the inner loop is left. Next, adjacent misaligned stores:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 16353 32 1 1999067480 cycles 4010486779 instructions 1000112211 L1-dcache-stores 1002090902 branches 1961937 branch-missesWe see about 2 cycles/store. With 64-byte cache lines, this contains two separate cases: 1) within a cache line, and 2) crossing cache lines. Let's separate that by using stride 64 and appropriate offsets: First, let's ensure that the larger stride does not have other effects, and start out with aligned accesses:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 32672 64 0 1053832353 cycles 4010368680 instructions 1000089056 L1-dcache-stores 1002068573 branches 1961410 branch-missesOk, as fast as adjacent aligned stores. Now, with misalignment inside the cache line:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 32673 64 1 1052772470 cycles 4010350832 instructions 1000087770 L1-dcache-stores 1002066544 branches 1961330 branch-missesThat's as fast as an aligned access. Now, crossing the cache line:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 32705 64 33 3001525156 cycles 4010659558 instructions 1000143975 L1-dcache-stores 1002122497 branches 1962163 branch-missesSo, a misaligned store into two cache lines costs 3 cycles. The adjacent misaligned case is pretty much the average of the within-cache-line and the across-cache-lines cases.
If we align stores, a way to deal with the bytes at the start and end of the region is to overlap unaligned stores at the start and end with aligned stores in between. Does overlapping have an extra penalty?
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 15842 31 0 1967737947 cycles 4010491704 instructions 1000112851 L1-dcache-stores 1002092226 branches 1961942 branch-missesOn the Haswell the penalty seems to come from the unaligned accesses in this variant, without additional penalty for overlaps.
We also want to isolate the effect of crossing the page boundaries. If we want to stay within the L1 cache, and given 32KB L1 cache and 4KB pages, this means that we must only store to 7 locations (an 8th location would already access a 9th page). To establish the baseline, we start out with adjacent aligned loads:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 224 32 0 1188937940 cycles 4714859688 instructions 1000088558 L1-dcache-stores 1142967886 branches 60805 branch-missesThe overhead is bigger than for the higher-trip-count variant; it can be explained mostly by the fact that the Haswell can only process one taken branch per cycle. The branch misses are lower, because a 7-iteration loop is short enough that the branch predictor captures it (history length are usually >15 or so branches).
Unaligned cases with stride 32 and 64 pretty much follow the patterns of the higher-trip-count counterparts on the Haswell; I don't show them here, but only below (in case there is something unusual on other CPUs). But here's the page-crossing case that interests us:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 28698 4096 4090 25005531353 cycles 4718613195 instructions 1000776116 L1-dcache-stores 1143657621 branches 8042 branch-missesAn unaligned page-crossing store costs 25 cycles on the Haswell.
Sandy Ivy Has- Sky- Gold. Grace Exca- Bridge Bridge well lake Cove mont vator Zen Zen2 Zen3 Zen4 2.04 2.04 1.05 1.05 1.07 1.29 3.76 2.00 1.04 1.03 1.04 aligned adjacent: stores 16352 32 0 4.18 4.19 2.00 1.63 1.63 2.29 5.07 4.13 2.13 1.21 1.20 unaligned adjacent: stores 16353 32 1 2.04 2.04 1.05 1.05 1.07 3.13 5.15 2.02 1.04 1.03 1.04 aligned: stores 32672 64 0 2.04 2.05 1.05 1.05 1.07 3.23 6.06 4.05 2.03 2.00 2.00 unaligned within-line: stores 32673 64 1 6.70 6.73 3.00 2.30 2.31 4.26 6.45 4.32 2.34 2.35 2.45 unaligned cross-line: stores 32705 64 33 4.06 4.07 1.97 1.62 1.58 2.20 4.64 4.01 2.10 1.53 1.23 overlapping: stores 15842 31 0 2.12 2.13 1.24 1.00 0.68 1.97 2.00 2.00 1.47 1.09 1.12 aligned adjacent: stores 224 32 0 3.15 3.16 2.14 1.57 1.58 2.14 4.00 4.00 2.00 1.61 1.15 unaligned adjacent: stores 257 32 33 2.13 2.13 1.27 1.00 1.02 1.91 2.00 2.00 1.41 1.13 1.12 aligned: stores 416 64 0 2.12 2.13 1.03 1.01 1.01 1.96 4.00 4.00 2.00 2.00 2.00 unaligned within-line: stores 417 64 1 4.00 4.01 3.00 2.00 2.00 3.00 4.00 4.00 2.00 2.00 2.00 unaligned cross-line: stores 449 64 33 199.41 200.95 25.01 24.02 24.01 35.27 44.04 26.02 24.02 27.01 34.03 unaligned cross-page: stores 28698 4096 4090The CPUs are:
Sandy Bridge: Xeon E3-1220 Ivy Bridge: Core i3-3227U Haswell: Core i7-4790K Skylake: Core i5-6600K Golden Cove: Core i3-1315U Gracemont: Core i3-1315U Excavator: Athlon X4 845 Zen: Ryzen 5 1600X Zen2: Ryzen 9 3900X Zen3: Ryzen 7 5800X Zen4: Ryzen 7 8700GWe see that Sandy Bridge, Ivy Bridge, Excavator, and Zen implement the AVX256 store as two 128-bit stores, each taking one cycle. Haswell, Skylake, and Zen2 have 256-bit wide stores.
The Intel CPUs have no penalty for unaligned within-line stores in this microbenchmark, the AMD CPUs have it.
All CPUs have a penalty for unaligned cross-cache-line stores. All CPUs have a high penalty for unaligned cross-page stores.
The Excavator shows a penalty for the bigger sizes, even though they should still fit in the D-cache. I have no explanation for that.
The Haswell and Zen2 show a penalty for the aligned low-trip-count case compared to the high-trip-count case (with varying results on different runs). Explanation: These CPUs store so fast that the extra overhead of performing the outer loop does not vanish in the shadow of the store limit. Then again, the Skylake stores fast and is still store-limited; it has enough other resources (it can process up to 6 instructions per cycle under favourable circumstances; here we see an average IPC of 4.7).
I have some numbers for unaligned load performance on mostly older hardware, but it includes the Sandy Bridge and Ivy Bridge: The page-crossing penalties are much lower for loads in these CPUs than for stores (28-32 cycles compared to 200cycles).
![]() | Name | Last modified | Size | Description |
---|---|---|---|---|
![]() | Parent Directory | - | ||
![]() | Makefile | 2020-03-03 18:31 | 921 | |
![]() | main.c | 2020-03-03 11:13 | 637 | |
![]() | main.o | 2020-03-03 11:13 | 2.2K | |
![]() | perf.data | 2020-03-03 16:57 | 58K | |
![]() | stores | 2020-03-03 16:54 | 7.6K | |
![]() | stores.c | 2020-03-03 16:54 | 506 | |
![]() | stores.o | 2020-03-03 16:54 | 1.5K | |