Auto-Vectorization and store-to-load forwarding

I have seen huge slowdowns from auto-vectorization for the bubble sort benchmark of John Hennesseys small integer benchmarks, because there is a partial overlap between the double-int store of one iteration and the double-int load of the next, and the usual store-to-load forwarding optimizations do not work; instead, CPUs take a slow path.

This kind of glass chin may not show up in the usual benchmarking (because it may not occur in the few hot loops of the benchmarks), but it may slow down other codes significantly (as demonstrated by the bubble-sort case).

Apparently the GCC maintainers noticed this themselves, and reduced the aggressiveness of auto-vectorization between gcc-12 and gcc-14, but there still is some. So I wanted to see if there still are cases where the auto-vectorization leads to slowdowns. So I wrote a program where the central loop is:

  for (i=0; i<n; i+=4) {
    long wl0, wl1;
    ns[i] = nl[i]+1;
    wl0 = wl[i];
    wl1 = wl[i+1];
    ws[i] = wl0;
    ws[i+1] = wl1;
  }
Here "n" stands for narrow, "w" for wide, "l" for load, and "s" for store (e.g., "ns" stands for "narrow store". gcc-14 compiles the body of the loop as follows:
gcc-14 -O                   gcc-14 -O3
loophead1:                  loophead2:
  mov (%rcx,%rax,8),%rdi      mov    (%rcx,%rax,8),%r10
  add $0x1,%rdi               lea    0x1(%r10),%r9
  mov %rdi,(%r10,%rax,8)      mov    %r9,(%rdi,%rax,8)
  mov (%rsi,%rax,8),%r9       movdqu (%rsi,%rax,8),%xmm0
  mov 0x8(%rsi,%rax,8),%rdi
  mov %r9,(%rdx,%rax,8)       movups %xmm0,(%rdx,%rax,8)
  mov %rdi,0x8(%rdx,%rax,8)
  add $0x4,%rax               add    $0x4,%rax
  cmp %r8,%rax		      cmp    %r8,%rax
  jb  loophead2		      jb     loophead2
In the -O3 version the two longs wl[i] and wl[i+1] are loaded with one movdqu instruction, and then are stored to ws[i] and ws[i+1] with one movups instruction, whereas the -O code loads and stores each long separately, resulting in a total of four mov instructions for these accesses. The remainder of the code is similar.

The surrounding code sets up nl, ns, wl, ws such that various memory dependencies are possible between the loads and the stores. Advancing i by 4 in each iteration allows to also set up cases where no memory dependences happen.

The loop shown above is exercised by the binaries stwfl-*, which take four parameters: offsets (in longs) of ns wl ws nl from a common base address (the ordering of the parameters puts nl last for historical reasons). E.g.,

  ./stwlf-x86_64-gcc-14-O3 2 2 4 0
means that ns and wl point to the same place, ws points 2 longs later, and nl points 2 longs earlier (4 longs before w2). Since i advances by 4, this means that ws[i] stores to the same place in one iteration that nl[i] loads from in the next iteration. In this way there is a dependence chain of stores and loads across iterations (two loads and two stores per iteration).

You can benchmark a whole group of parameter sets on stwlf-x86_64-gcc-14-O and stwlf-x86_64-gcc-14-O3 with:

  make
or benchmark the same group of parameter sets on a specific version with
  make bench CC=gcc-14 OPT=-O3
which will also build the needed binary if it does not exist yet. So for downloading and running the existing binaries on a specific machine you do:
  wget -O - https://www.complang.tuwien.ac.at/anton/stwlf/stwlf.zip | tar xfJ -
  cd stwlf
  make
The output contains numbers which are the number of cycles of one iteration of the loop. It also contains a description of the data flow of each iteration. The dependencies nl>ns and wl>ws through registers are built into the binary and cannot be changed by parameter settings, but there can be memory dependencies between any store and any load due to the way the parameters are set. These dependencies are shown as:
  =>   
  =_=> mixed narrow/wide access where the low-address long of the wide access is accessed by the narrow access.
  =^=> mixed narrow/wide access where the high-address long of the wide access is accessed by the narrow access.
  _=^> partially overlapping wide store to wide load dependency, with the low long of the store becoming the high long of the load (i.e., the store address is one long higher than the load address).
A "recurrence" means that there is a dependence chain within one iteration that continues with a loop-carried dependency to the start of the chain in the next iteration, resulting in a dependence chain throughout all iterations of the loop.

The dependencies are my guess at what causes these numbers. There may also be other microarchitectural effects at work that depend on the actual parameter set and the hardware, and you may want to look at the actual parameter sets for deeper insight.

The numbers come from one run (100_000 repetitions of the loop with 1000 iteratins), and at least on Zen 3 I have seen clustered variations for certain results (e.g., sometimes a given parameter set is measured as costing 3.00 cycles/iteration, and sometimes 3.15-3.17 cycles/iteration), but in the big scheme of things these variations are minor.

Here are some numbers for a few recent microarchitectures:

   Zen 4
  -O   -O3
 2.70  1.42 nl>ns wl>ws (maximally independent)
 3.09 21.00 nl>ns=_=>wl>ws
 3.09 21.00 nl>ns=^=>wl>ws
 3.00 29.00 nl>n2=_=>wl>ws=>nl (recurrence)
 3.00  8.80 wl>ws=>wl (recurrence), nl>ns (no recurrence)
 3.09  2.00 nl>ns=>nl (recurrence), wl>ws (no recurrence)
 3.07 21.00 nl>ns=>nl,wl>ws
 3.06 29.00 ws=_=>nl>ns=_=>wl
 3.09  2.00 wl>ws=_=>nl>ns
 3.09 22.33 wl>ws_=^>wl (recurrence) nl>ns (no recurrence)
 3.00 29.00 ws=_=>nl>ns=^=>wl>ws (recurrence), also ws_=^>wl