Auto-Vectorization and store-to-load forwarding

I have seen huge slowdowns from auto-vectorization for the bubble-sort benchmark of John Hennesseys small integer benchmarks, because there is a partial overlap between the double-int store of one iteration and the double-int load of the next, and the usual hardware store-to-load forwarding optimizations do not work in such cases; instead, CPUs take a slow path.

This kind of glass chin may not show up in the usual benchmarking (because it may not occur in the few hot loops of the benchmarks), but it may slow down other codes significantly (as demonstrated by the bubble-sort case).

Apparently the GCC maintainers noticed this themselves, and reduced the aggressiveness of auto-vectorization between gcc-12 and gcc-14, but there still is some. So I wanted to see if there still are cases where the auto-vectorization leads to slowdowns. So I wrote a microbenchmark where the central loop is:

  for (i=0; i<n; i+=4) {
    long wl0, wl1;
    ns[i] = nl[i]+1;
    wl0 = wl[i];
    wl1 = wl[i+1];
    ws[i] = wl0;
    ws[i+1] = wl1;
  }
Here "n" stands for narrow (one long), "w" for wide (two longs), "l" for load, and "s" for store (e.g., "ns" stands for "narrow store". gcc-14 compiles the body of the loop as follows:
gcc-14 -O                   gcc-14 -O3
loophead1:                  loophead2:
  mov (%rcx,%rax,8),%rdi      mov    (%rcx,%rax,8),%r10
  add $0x1,%rdi               lea    0x1(%r10),%r9
  mov %rdi,(%r10,%rax,8)      mov    %r9,(%rdi,%rax,8)
  mov (%rsi,%rax,8),%r9       movdqu (%rsi,%rax,8),%xmm0
  mov 0x8(%rsi,%rax,8),%rdi
  mov %r9,(%rdx,%rax,8)       movups %xmm0,(%rdx,%rax,8)
  mov %rdi,0x8(%rdx,%rax,8)
  add $0x4,%rax               add    $0x4,%rax
  cmp %r8,%rax		      cmp    %r8,%rax
  jb  loophead2		      jb     loophead2
In the -O3 version the two longs wl[i] and wl[i+1] are loaded with one movdqu instruction, and then are stored to ws[i] and ws[i+1] with one movups instruction, whereas the -O code loads and stores each long separately, resulting in a total of four mov instructions for these accesses. The remainder of the code does not differ much between the optimization levels.

The surrounding code sets up nl, ns, wl, ws such that various memory dependencies are possible between the loads and the stores. Advancing i by 4 in each iteration allows to also set up cases where no memory dependences happen.

The loop shown above is exercised by the binaries stwlf-*, which take four parameters: offsets (in longs) of ns wl ws nl from a common base address (the ordering of the parameters puts nl last for historical reasons). E.g.,

  ./stwlf-x86_64-gcc-14-O3 2 2 4 0
means that ns and wl point to the same place, ws points 2 longs later, and nl points 2 longs earlier (4 longs before w2). Since i advances by 4, this means that ws[i] stores to the same place in one iteration that nl[i] loads from in the next iteration. In this way there is a dependence chain of stores and loads across iterations (two loads and two stores per iteration).

You can benchmark a whole group of parameter sets on stwlf-x86_64-gcc-14-O and stwlf-x86_64-gcc-14-O3 with:

  make
or benchmark the same group of parameter sets on a specific version with
  make bench CC=gcc-14 OPT=-O3
which will also build the needed binary if it does not exist yet. So for downloading and running the existing binaries on a specific machine you do:
  wget -O - https://www.complang.tuwien.ac.at/anton/stwlf/stwlf.tar.xz | tar xfJ -
  cd stwlf
  make
The output contains numbers which are the number of cycles of one iteration of the loop. It also contains a description of the data flow of each iteration. The dependencies nl>ns and wl>ws through registers are built into the binary and cannot be changed by parameter settings, but there can be memory dependencies between any store and any load due to the way the parameters are set. These dependencies are shown as:
  =>   full width and fully overlapping
  =_=> mixed narrow/wide access where the low-address long of the wide access is accessed by the narrow access.
  =^=> mixed narrow/wide access where the high-address long of the wide access is accessed by the narrow access.
  _=^> partially overlapping wide store to wide load dependency, with the low long of the store becoming the high long of the load (i.e., the store address is one long higher than the load address).
A "recurrence" means that there is a dependence chain within one iteration that continues with a loop-carried dependency to the start of the chain in the next iteration, resulting in a dependence chain throughout all iterations of the loop.

The dependencies are my guess at what causes these numbers. There may also be other microarchitectural effects at work that depend on the actual parameter set and the hardware, and you may want to look at the actual parameter sets (in the Makefile) for deeper insight.

The numbers come from one run (100_000 repetitions of the loop with 1000 iteratins), and at least on Zen 3 I have seen clustered variations for certain results (e.g., sometimes a given parameter set is measured usually as costing 3.16-3.18 cycles/iteration, and sometimes 3.00 cycles/iteration), but in the big scheme of things these variations are minor.

Here are some numbers for a few recent microarchitectures:

                    AMD                      ||           Intel P-cores          ||    Intel E-Cores     |
  Zen 4   |   Zen 3  |    Zen 2  |     Zen   ||Golden Cove|Rocket Lake|  Skylake ||Gracemont |   Tremont |
 -O   -O3 | -O   -O3 |  -O   -O3 |  -O   -O3 ||  -O   -O3 |  -O   -O3 | -O   -O3 || -O   -O3 |  -O   -O3 |  dependencies
2.14  1.41|2.09  1.52| 3.02  2.01| 3.02  2.02|| 1.65  1.27| 2.43  1.90|3.10  2.03||2.04  1.63| 3.09  2.04|nl>ns wl>ws (maximally independent)
3.00 21.00|3.17 20.01| 3.02 18.01| 3.05 16.01|| 2.01  1.57| 2.84 20.00|3.07 16.01||2.04  6.18| 3.58 14.51|nl>ns=_=>wl>ws
3.09 21.00|3.17 20.01| 3.01 18.01| 3.03 16.01|| 1.94 19.01| 2.81 20.00|3.10 16.01||2.03 15.51| 3.56 14.51|nl>ns=^=>wl>ws
3.00 29.00|3.18 28.00|15.03 22.49|14.84 22.59||10.84 22.49|12.50 25.00|9.73 21.00||2.04 19.00|10.97 17.51|nl>n2=_=>wl>ws=>nl (recurrence)
3.09  8.80|3.18  8.87| 6.95  7.86| 7.02  7.95|| 5.08  7.16| 5.12  5.04|4.39  4.93||2.04  6.90| 5.00  5.97|wl>ws=>wl (recurrence), nl>ns (no recurrence)
3.08  2.00|3.16  2.10| 8.06  7.86| 8.14  7.88|| 7.27  7.22| 7.14  7.23|4.46  5.60||2.44  1.63| 6.00  5.98|nl>ns=>nl (recurrence), wl>ws (no recurrence)
3.07 21.00|3.16 20.01| 8.75 18.12| 8.81 16.01|| 6.98  6.50| 7.18 20.00|5.73 16.02||7.91 10.08| 7.00 14.51|nl>ns=>nl,wl>ws
3.07 29.00|3.18 28.00|15.03 22.49|14.84 22.58||10.84 22.51|12.50 25.00|9.73 21.00||2.04 19.00|10.98 17.51|ws=_=>nl>ns=_=>wl
3.09  2.00|3.17  2.11| 3.03  2.19| 3.13  2.25|| 2.01  1.26| 2.85  1.99|3.08  2.03||2.04  1.72| 3.59  2.12|wl>ws=_=>nl>ns
3.09 22.33|3.18 21.12| 3.26 24.09| 3.31 22.11|| 2.03 19.63| 3.39 20.64|3.15 16.64||2.17 16.02| 4.06 15.06|wl>ws_=^>wl (recurrence) nl>ns (no recurrence)
3.00 29.00|3.17 28.00|10.89 24.00|10.92 22.57|| 8.01 25.97|11.85 25.00|7.72 21.00||2.97 19.00| 8.01 17.50|ws=_=>nl>ns=^=>wl>ws (recurrence), also ws_=^>wl
If there are no memory dependencies, the -O3 binary (i.e., the vectorized wide accesses for the two adjacent longs) outperforms -O binary (two scalar accesses) on every of these microarchitectures; but note that the benchmark is more store-heavy than typical code (30% of the -O instructions are stores), and the 1.5 times higher number of stores in the -O code may have more influence on the performance than in most other code, just as the slowdowns discussed below are more prominent in this code than in most other code.

As soon as the wide load depends on any memory written by any close-by instruction (even a fully overlapping wide store), -O outperforms -O3 in most cases, on all microarchitectures; when the memory dependence is from a narrow store or a partially overlapping wide store to the wide load, the slowdown is huge. A dependence from a wide store to a narrow load seems to occur no additional cost and is almost as cheap as no dependence.

The narrow-to-wide slowdown occurs even in those cases where the dependency chain does not include a recurrence. Theories that might explain this: 1) The load/store unit is blocked by whatever happens in the slow path for 10-20 cycles (depending on the microarchitecture). 2) Or the situation results in a microcode assist.

The only other case where -O3 outperforms -O is when the only memory dependency is a wide-store to narrow-load dependency.

One interesting aspect is that, while the cores tend to cope with the memory dependencies better in more recent microarchitectures (the leftmost of each family is the most recent), the penalties for memory dependencies involving wide loads (in -O3) have a tendency to increase in more recent generations.

Recommendations for compiler writers

In the case of the bubble-sort benchmark (c-manual/bubble-sort.c in bench.zip), on a Rocket Lake (Xeon E-2388G) the auto-vectorization leads to a slowdown by a factor of 5.7 of gcc-14.2 -O3 over gcc-14.2 -O3 -fno-tree-vectorize, by a factor of 5.6 over clang-19.1 -O3 (where clang does not vectorize anything in this benchmark), by a factor of 5.3 over gcc-14.2 -O, and by a factor of 2.1 over gcc-14.2 -O0.

It is unclear in how many other cases the slow path becomes a performance problem, but reliably fast code looks more desirable to me than code that is often slightly faster, but sometimes much slower.

So I think that compilers should avoid combining adjacent loads into wide loads unless they have good evidence that there are no recent stores to the same location.

Recommendations for programmers

One way to avoid this pitfall of gcc may be to use clang. Another way would be to disable auto-vectorization with -fno-tree-vectorize, but this also disables vectorization in cases where such slowdowns cannot happen (e.g., in a loop that only reads from memory).

A way to avoid slowdowns in a specific case would be to put an optimization hurdle (such as volatile or an empty asm statement that declares that it modifies memory) in or between the loads that should not be combined into a wide load. But how do you learn that you have such a performance problem? Maybe with perf record followed by perf annotate; for bubble-sort compiled with gcc-14.2 -O3 it attributes 65% of the cycles to the wide load.

Other architectures

ARM A64 has wide SIMD loads as used by gcc for AMD64 here. It also has load-pair instructions, and my guess is that on the memory side it performs a wide load. I expect that store-single to load-pair memory dependencies are much more common on ARM A64 than narrow-to-wide (store-GPR to load-XMM) dependencies on AMD64. It will be interesting to measure how expensive such dependencies are on microarchitectures that implement the ARM A64 architecture and if store-64-bit-GPR to load-128-bit-SIMD are more expensive. However, I will leave that to another day (or to the interested reader).
Anton Ertl
[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[   ]stwlf-x86_64-gcc-12-O32026-01-27 16:39 744K 
[   ]stwlf-x86_64-gcc-12-O2026-01-27 17:17 744K 
[   ]stwlf-x86_64-cc-O32026-01-28 00:15 744K 
[   ]stwlf-x86_64-gcc-14-O32026-01-27 17:16 741K 
[   ]stwlf-x86_64-gcc-14-O2026-01-27 17:25 741K 
[   ]stwlf.tar.xz2026-01-28 15:27 282K 
[   ]Makefile2026-01-28 15:08 3.9K 
[TXT]main.c2026-01-27 16:38 1.1K 

Apache/2.4.62 (Debian) OpenSSL/3.0.15 Server at www.complang.tuwien.ac.at Port 443