Cost of unaligned stores

How expensive are unaligned stores? To determine that, I have written the microbenchmark below. You call it with
  stores size stride offset
It performs AVX256 (32-byte) stores in the page-aligned region of size size, one store every stride bytes, starting at offset offset from the start of the page. It performs stores until the region ends, then repeats that, until the number of stores exceeds 10^9. The native code for the two inner loops is:
┌─→mov     %rdi,%rax
│┌→vmovdqu %ymm0,(%rax)
││ add     %rcx,%rax
││ cmp     %rax,%rsi
│└─jae     2b
│  mov     %rdx,%rax
│  lea     (%rdx,%r8,1),%rdx
│  cmp     $0x3b9ac9ff,%rax
└──jbe     28
The vmovdqu is the store.

I first present the various runs to measure different features by showing several performance counters on a Haswell (Core i7-4790K); later I show just the cycles/store (plus overhead) for various hardware.

Let's start with the best case: aligned adjacent stores in the L1 cache with high trip counts (511 iterations) for the inner loop:

perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 16352 32 0
1053801504  cycles
4010346927  instructions
1000085309  L1-dcache-stores
1002064930  branches
1961365  branch-misses
We see 1 cycle/store plus a little overhead. We also see one branch misprediction every time the inner loop is left. Next, adjacent misaligned stores:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 16353 32 1
1999067480  cycles
4010486779  instructions
1000112211  L1-dcache-stores
1002090902  branches
1961937  branch-misses
We see about 2 cycles/store. With 64-byte cache lines, this contains two separate cases: 1) within a cache line, and 2) crossing cache lines. Let's separate that by using stride 64 and appropriate offsets: First, let's ensure that the larger stride does not have other effects, and start out with aligned accesses:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 32672 64 0
1053832353  cycles
4010368680  instructions
1000089056  L1-dcache-stores
1002068573  branches
1961410  branch-misses
Ok, as fast as adjacent aligned stores. Now, with misalignment inside the cache line:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 32673 64 1
1052772470  cycles
4010350832  instructions
1000087770  L1-dcache-stores
1002066544  branches
1961330  branch-misses
That's as fast as an aligned access. Now, crossing the cache line:
perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 32705 64 33
3001525156  cycles
4010659558  instructions
1000143975  L1-dcache-stores
1002122497  branches
1962163  branch-misses
So, a misaligned store into two cache lines costs 3 cycles. The adjacent misaligned case is pretty much the average of the within-cache-line and the across-cache-lines cases.

If we align stores, a way to deal with the bytes at the start and end of the region is to overlap unaligned stores at the start and end with aligned stores in between. Does overlapping have an extra penalty?

perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 15842 31 0
1967737947  cycles
4010491704  instructions
1000112851  L1-dcache-stores
1002092226  branches
1961942  branch-misses
On the Haswell the penalty seems to come from the unaligned accesses in this variant, without additional penalty for overlaps.

We also want to isolate the effect of crossing the page boundaries. If we want to stay within the L1 cache, and given 32KB L1 cache and 4KB pages, this means that we must only store to 7 locations (an 8th location would already access a 9th page). To establish the baseline, we start out with adjacent aligned loads:

perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 224 32 0
1188937940  cycles
4714859688  instructions
1000088558  L1-dcache-stores
1142967886  branches
60805  branch-misses
The overhead is bigger than for the higher-trip-count variant; it can be explained mostly by the fact that the Haswell can only process one taken branch per cycle. The branch misses are lower, because a 7-iteration loop is short enough that the branch predictor captures it (history length are usually >15 or so branches).

Unaligned cases with stride 32 and 64 pretty much follow the patterns of the higher-trip-count counterparts on the Haswell; I don't show them here, but only below (in case there is something unusual on other CPUs). But here's the page-crossing case that interests us:

perf stat -x' ' -e cycles -e instructions -e L1-dcache-stores -e branches -e branch-misses stores 28698 4096 4090
25005531353  cycles
4718613195  instructions
1000776116  L1-dcache-stores
1143657621  branches
8042  branch-misses
An unaligned page-crossing store costs 25 cycles on the Haswell.

Various hardware

This table shows the cycles/store (plus overhead) on various hardware:
 Sandy   Ivy   Has-  Sky- Exca-
Bridge Bridge  well  lake vator  Zen   Zen2
  2.04   2.04  1.05  1.05  3.76  2.00  1.04 aligned adjacent: stores 16352 32 0
  4.18   4.19  2.00  1.63  5.07  4.13  2.13 unaligned adjacent: stores 16353 32 1
  2.04   2.04  1.05  1.05  5.15  2.02  1.04 aligned: stores 32672 64 0
  2.04   2.05  1.05  1.05  6.06  4.05  2.03 unaligned within-line: stores 32673 64 1
  6.70   6.73  3.00  2.30  6.45  4.32  2.34 unaligned cross-line: stores 32705 64 33
  4.06   4.07  1.97  1.62  4.64  4.01  2.10 overlapping: stores 15842 31 0
  2.12   2.13  1.24  1.00  2.00  2.00  1.47 aligned adjacent: stores 224 32 0
  3.15   3.16  2.14  1.57  4.00  4.00  2.00 unaligned adjacent: stores 257 32 33
  2.13   2.13  1.27  1.00  2.00  2.00  1.41 aligned: stores 416 64 0
  2.12   2.13  1.03  1.01  4.00  4.00  2.00 unaligned within-line: stores 417 64 1
  4.00   4.01  3.00  2.00  4.00  4.00  2.00 unaligned cross-line: stores 449 64 33
199.41 200.95 25.01 24.02 44.04 26.02 24.02 unaligned cross-page: stores 28698 4096 4090
The CPUs are:
  Sandy Bridge: Xeon E3-1220
  Ivy Bridge: Core i3-3227U
  Haswell: Core i7-4790K
  Skylake: Core i5-6600K
  Excavator: Athlon X4 845
  Zen: Ryzen 5 1600X
  Zen2: Ryzen 9 3900X
We see that Sandy Bridge, Ivy Bridge, Excavator, and Zen implement the AVX256 store as two 128-bit stores, each taking one cycle. Haswell, Skylake, and Zen2 have 256-bit wide stores.

The Intel CPUs have no penalty for unaligned within-line stores in this microbenchmark, the AMD CPUs have it.

All CPUs have a penalty for unaligned cross-cache-line stores. All CPUs have a high penalty for unaligned cross-page stores.

The Excavator shows a penalty for the bigger sizes, even though they should still fit in the D-cache. I have no explanation for that.

The Haswell and Zen2 show a penalty for the aligned low-trip-count case compared to the high-trip-count case (with varying results on different runs). Explanation: These CPUs store so fast that the extra overhead of performing the outer loop does not vanish in the shadow of the store limit. Then again, the Skylake stores fast and is still store-limited; it has enough other resources (it can process up to 6 instructions per cycle under favourable circumstances; here we see an average IPC of 4.7).

I have some numbers for unaligned load performance on mostly older hardware, but it includes the Sandy Bridge and Ivy Bridge: The page-crossing penalties are much lower for loads in these CPUs than for stores (28-32 cycles compared to 200cycles).


Anton Ertl
[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[   ]Makefile03-Mar-2020 18:31 921  
[TXT]main.c03-Mar-2020 11:13 637  
[   ]main.o03-Mar-2020 11:13 2.2K 
[   ]perf.data03-Mar-2020 16:57 58K 
[   ]stores03-Mar-2020 16:54 7.6K 
[TXT]stores.c03-Mar-2020 16:54 506  
[   ]stores.o03-Mar-2020 16:54 1.5K 

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at www.complang.tuwien.ac.at Port 80