This kind of glass chin may not show up in the usual benchmarking (because it may not occur in the few hot loops of the benchmarks), but it may slow down other codes significantly (as demonstrated by the bubble-sort case).
Apparently the GCC maintainers noticed this themselves, and reduced the aggressiveness of auto-vectorization between gcc-12 and gcc-14, but there still is some. So I wanted to see if there still are cases where the auto-vectorization leads to slowdowns. So I wrote a microbenchmark where the central loop is:
for (i=0; i<n; i+=4) {
long wl0, wl1;
ns[i] = nl[i]+1;
wl0 = wl[i];
wl1 = wl[i+1];
ws[i] = wl0;
ws[i+1] = wl1;
}
Here "n" stands for narrow (one long), "w" for wide (two longs), "l"
for load, and "s" for store (e.g., "ns" stands for "narrow store".
gcc-14 compiles the body of the loop as follows:
gcc-14 -O gcc-14 -O3 loophead1: loophead2: mov (%rcx,%rax,8),%rdi mov (%rcx,%rax,8),%r10 add $0x1,%rdi lea 0x1(%r10),%r9 mov %rdi,(%r10,%rax,8) mov %r9,(%rdi,%rax,8) mov (%rsi,%rax,8),%r9 movdqu (%rsi,%rax,8),%xmm0 mov 0x8(%rsi,%rax,8),%rdi mov %r9,(%rdx,%rax,8) movups %xmm0,(%rdx,%rax,8) mov %rdi,0x8(%rdx,%rax,8) add $0x4,%rax add $0x4,%rax cmp %r8,%rax cmp %r8,%rax jb loophead2 jb loophead2In the
-O3 version the two longs wl[i]
and wl[i+1] are loaded with one
movdqu instruction, and then are stored
to ws[i] and ws[i+1] with one
movups instruction, whereas the -O code
loads and stores each long separately, resulting in a total of
four mov instructions for these accesses. The remainder
of the code does not differ much between the optimization levels.
The surrounding code sets up nl, ns, wl, ws such that various memory dependencies are possible between the loads and the stores. Advancing i by 4 in each iteration allows to also set up cases where no memory dependences happen.
The loop shown above is exercised by the
binaries stwlf-*, which take four parameters: offsets (in
longs) of ns wl ws nl from a common base address (the ordering of the
parameters puts nl last for historical reasons). E.g.,
./stwlf-x86_64-gcc-14-O3 2 2 4 0means that ns and wl point to the same place, ws points 2 longs later, and nl points 2 longs earlier (4 longs before w2). Since i advances by 4, this means that
ws[i] stores to the same place in
one iteration that nl[i] loads from in the next
iteration. In this way there is a dependence chain of stores and
loads across iterations (two loads and two stores per iteration).
You can benchmark a whole group of parameter sets
on stwlf-x86_64-gcc-14-O
and stwlf-x86_64-gcc-14-O3 with:
makeor benchmark the same group of parameter sets on a specific version with
make bench CC=gcc-14 OPT=-O3which will also build the needed binary if it does not exist yet. So for downloading and running the existing binaries on a specific machine you do:
wget -O - https://www.complang.tuwien.ac.at/anton/stwlf/stwlf.tar.xz | tar xfJ - cd stwlf makeThe output contains numbers which are the number of cycles of one iteration of the loop. It also contains a description of the data flow of each iteration. The dependencies
nl>ns
and wl>ws through registers are built into the binary
and cannot be changed by parameter settings, but there can be memory
dependencies between any store and any load due to the way the
parameters are set. These dependencies are shown as:
=> full width and fully overlapping =_=> mixed narrow/wide access where the low-address long of the wide access is accessed by the narrow access. =^=> mixed narrow/wide access where the high-address long of the wide access is accessed by the narrow access. _=^> partially overlapping wide store to wide load dependency, with the low long of the store becoming the high long of the load (i.e., the store address is one long higher than the load address).A "recurrence" means that there is a dependence chain within one iteration that continues with a loop-carried dependency to the start of the chain in the next iteration, resulting in a dependence chain throughout all iterations of the loop.
The dependencies are my guess at what causes these numbers. There may also be other microarchitectural effects at work that depend on the actual parameter set and the hardware, and you may want to look at the actual parameter sets (in the Makefile) for deeper insight.
The numbers come from one run (100_000 repetitions of the loop with 1000 iteratins), and at least on Zen 3 I have seen clustered variations for certain results (e.g., sometimes a given parameter set is measured usually as costing 3.16-3.18 cycles/iteration, and sometimes 3.00 cycles/iteration), but in the big scheme of things these variations are minor.
Here are some numbers for a few recent microarchitectures:
AMD || Intel P-cores || Intel E-Cores |
Zen 4 | Zen 3 | Zen 2 | Zen ||Golden Cove|Rocket Lake| Skylake ||Gracemont | Tremont |
-O -O3 | -O -O3 | -O -O3 | -O -O3 || -O -O3 | -O -O3 | -O -O3 || -O -O3 | -O -O3 | dependencies
2.14 1.41|2.09 1.52| 3.02 2.01| 3.02 2.02|| 1.65 1.27| 2.43 1.90|3.10 2.03||2.04 1.63| 3.09 2.04|nl>ns wl>ws (maximally independent)
3.00 21.00|3.17 20.01| 3.02 18.01| 3.05 16.01|| 2.01 1.57| 2.84 20.00|3.07 16.01||2.04 6.18| 3.58 14.51|nl>ns=_=>wl>ws
3.09 21.00|3.17 20.01| 3.01 18.01| 3.03 16.01|| 1.94 19.01| 2.81 20.00|3.10 16.01||2.03 15.51| 3.56 14.51|nl>ns=^=>wl>ws
3.00 29.00|3.18 28.00|15.03 22.49|14.84 22.59||10.84 22.49|12.50 25.00|9.73 21.00||2.04 19.00|10.97 17.51|nl>n2=_=>wl>ws=>nl (recurrence)
3.09 8.80|3.18 8.87| 6.95 7.86| 7.02 7.95|| 5.08 7.16| 5.12 5.04|4.39 4.93||2.04 6.90| 5.00 5.97|wl>ws=>wl (recurrence), nl>ns (no recurrence)
3.08 2.00|3.16 2.10| 8.06 7.86| 8.14 7.88|| 7.27 7.22| 7.14 7.23|4.46 5.60||2.44 1.63| 6.00 5.98|nl>ns=>nl (recurrence), wl>ws (no recurrence)
3.07 21.00|3.16 20.01| 8.75 18.12| 8.81 16.01|| 6.98 6.50| 7.18 20.00|5.73 16.02||7.91 10.08| 7.00 14.51|nl>ns=>nl,wl>ws
3.07 29.00|3.18 28.00|15.03 22.49|14.84 22.58||10.84 22.51|12.50 25.00|9.73 21.00||2.04 19.00|10.98 17.51|ws=_=>nl>ns=_=>wl
3.09 2.00|3.17 2.11| 3.03 2.19| 3.13 2.25|| 2.01 1.26| 2.85 1.99|3.08 2.03||2.04 1.72| 3.59 2.12|wl>ws=_=>nl>ns
3.09 22.33|3.18 21.12| 3.26 24.09| 3.31 22.11|| 2.03 19.63| 3.39 20.64|3.15 16.64||2.17 16.02| 4.06 15.06|wl>ws_=^>wl (recurrence) nl>ns (no recurrence)
3.00 29.00|3.17 28.00|10.89 24.00|10.92 22.57|| 8.01 25.97|11.85 25.00|7.72 21.00||2.97 19.00| 8.01 17.50|ws=_=>nl>ns=^=>wl>ws (recurrence), also ws_=^>wl
If there are no memory dependencies, the -O3 binary (i.e., the
vectorized wide accesses for the two adjacent longs) outperforms -O
binary (two scalar accesses) on every of these microarchitectures; but
note that the benchmark is more store-heavy than typical code (30% of
the -O instructions are stores), and the 1.5 times higher number of
stores in the -O code may have more influence on the performance than
in most other code, just as the slowdowns discussed below are more
prominent in this code than in most other code.
As soon as the wide load depends on any memory written by any
close-by instruction (even a fully overlapping wide
store), -O outperforms -O3 in most cases, on
all microarchitectures; when the memory dependence is from a narrow
store or a partially overlapping wide store to the wide load, the
slowdown is huge. A dependence from a wide store to a narrow load
seems to occur no additional cost and is almost as cheap as no
dependence.
The narrow-to-wide slowdown occurs even in those cases where the dependency chain does not include a recurrence. Theories that might explain this: 1) The load/store unit is blocked by whatever happens in the slow path for 10-20 cycles (depending on the microarchitecture). 2) Or the situation results in a microcode assist.
The only other case where -O3
outperforms -O is when the only memory dependency is a
wide-store to narrow-load dependency.
One interesting aspect is that, while the cores tend to cope with the memory dependencies better in more recent microarchitectures (the leftmost of each family is the most recent), the penalties for memory dependencies involving wide loads (in -O3) have a tendency to increase in more recent generations.
gcc-14.2 -O3
over gcc-14.2 -O3 -fno-tree-vectorize, by a factor of 5.6
over clang-19.1 -O3 (where clang does not vectorize
anything in this benchmark), by a factor of 5.3 over gcc-14.2
-O, and by a factor of 2.1 over gcc-14.2 -O0.
It is unclear in how many other cases the slow path becomes a performance problem, but reliably fast code looks more desirable to me than code that is often slightly faster, but sometimes much slower.
So I think that compilers should avoid combining adjacent loads into wide loads unless they have good evidence that there are no recent stores to the same location.
-fno-tree-vectorize, but this also disables
vectorization in cases where such slowdowns cannot happen (e.g., in a
loop that only reads from memory).
A way to avoid slowdowns in a specific case would be to put an
optimization hurdle (such as volatile or an
empty asm statement that declares that it modifies
memory) in or between the loads that should not be combined into a
wide load. But how do you learn that you have such a performance
problem? Maybe with perf record followed by perf
annotate; for bubble-sort compiled with gcc-14.2
-O3 it attributes 65% of the cycles to the wide load.
| Name | Last modified | Size | Description | |
|---|---|---|---|---|
| Parent Directory | - | |||
| main.c | 2026-01-27 16:38 | 1.1K | ||
| stwlf-x86_64-gcc-12-O3 | 2026-01-27 16:39 | 744K | ||
| stwlf-x86_64-gcc-14-O3 | 2026-01-27 17:16 | 741K | ||
| stwlf-x86_64-gcc-12-O | 2026-01-27 17:17 | 744K | ||
| stwlf-x86_64-gcc-14-O | 2026-01-27 17:25 | 741K | ||
| stwlf-x86_64-cc-O3 | 2026-01-28 00:15 | 744K | ||
| Makefile | 2026-01-28 15:08 | 3.9K | ||
| stwlf.tar.xz | 2026-01-28 15:27 | 282K | ||