This kind of glass chin may not show up in the usual benchmarking (because it may not occur in the few hot loops of the benchmarks), but it may slow down other codes significantly (as demonstrated by the bubble-sort case).
Apparently the GCC maintainers noticed this themselves, and reduced the aggressiveness of auto-vectorization between gcc-12 and gcc-14, but there still is some. So I wanted to see if there still are cases where the auto-vectorization leads to slowdowns. So I wrote a program where the central loop is:
for (i=0; i<n; i+=4) {
long wl0, wl1;
ns[i] = nl[i]+1;
wl0 = wl[i];
wl1 = wl[i+1];
ws[i] = wl0;
ws[i+1] = wl1;
}
Here "n" stands for narrow, "w" for wide, "l" for load, and "s" for
store (e.g., "ns" stands for "narrow store". gcc-14 compiles the body
of the loop as follows:
gcc-14 -O gcc-14 -O3 loophead1: loophead2: mov (%rcx,%rax,8),%rdi mov (%rcx,%rax,8),%r10 add $0x1,%rdi lea 0x1(%r10),%r9 mov %rdi,(%r10,%rax,8) mov %r9,(%rdi,%rax,8) mov (%rsi,%rax,8),%r9 movdqu (%rsi,%rax,8),%xmm0 mov 0x8(%rsi,%rax,8),%rdi mov %r9,(%rdx,%rax,8) movups %xmm0,(%rdx,%rax,8) mov %rdi,0x8(%rdx,%rax,8) add $0x4,%rax add $0x4,%rax cmp %r8,%rax cmp %r8,%rax jb loophead2 jb loophead2In the
-O3 version the two longs wl[i]
and wl[i+1] are loaded with one
movdqu instruction, and then are stored
to ws[i] and ws[i+1] with one
movups instruction, whereas the -O code
loads and stores each long separately, resulting in a total of
four mov instructions for these accesses. The remainder
of the code is similar.
The surrounding code sets up nl, ns, wl, ws such that various memory dependencies are possible between the loads and the stores. Advancing i by 4 in each iteration allows to also set up cases where no memory dependences happen.
The loop shown above is exercised by the
binaries stwfl-*, which take four parameters: offsets (in
longs) of ns wl ws nl from a common base address (the ordering of the
parameters puts nl last for historical reasons). E.g.,
./stwlf-x86_64-gcc-14-O3 2 2 4 0means that ns and wl point to the same place, ws points 2 longs later, and nl points 2 longs earlier (4 longs before w2). Since i advances by 4, this means that
ws[i] stores to the same place in
one iteration that nl[i] loads from in the next
iteration. In this way there is a dependence chain of stores and
loads across iterations (two loads and two stores per iteration).
You can benchmark a whole group of parameter sets
on stwlf-x86_64-gcc-14-O
and stwlf-x86_64-gcc-14-O3 with:
makeor benchmark the same group of parameter sets on a specific version with
make bench CC=gcc-14 OPT=-O3which will also build the needed binary if it does not exist yet. So for downloading and running the existing binaries on a specific machine you do:
wget -O - https://www.complang.tuwien.ac.at/anton/stwlf/stwlf.zip | tar xfJ - cd stwlf makeThe output contains numbers which are the number of cycles of one iteration of the loop. It also contains a description of the data flow of each iteration. The dependencies
nl>ns
and wl>ws through registers are built into the binary
and cannot be changed by parameter settings, but there can be memory
dependencies between any store and any load due to the way the
parameters are set. These dependencies are shown as:
=> =_=> mixed narrow/wide access where the low-address long of the wide access is accessed by the narrow access. =^=> mixed narrow/wide access where the high-address long of the wide access is accessed by the narrow access. _=^> partially overlapping wide store to wide load dependency, with the low long of the store becoming the high long of the load (i.e., the store address is one long higher than the load address).A "recurrence" means that there is a dependence chain within one iteration that continues with a loop-carried dependency to the start of the chain in the next iteration, resulting in a dependence chain throughout all iterations of the loop.
The dependencies are my guess at what causes these numbers. There may also be other microarchitectural effects at work that depend on the actual parameter set and the hardware, and you may want to look at the actual parameter sets for deeper insight.
The numbers come from one run (100_000 repetitions of the loop with 1000 iteratins), and at least on Zen 3 I have seen clustered variations for certain results (e.g., sometimes a given parameter set is measured as costing 3.00 cycles/iteration, and sometimes 3.15-3.17 cycles/iteration), but in the big scheme of things these variations are minor.
Here are some numbers for a few recent microarchitectures:
Zen 4 -O -O3 2.70 1.42 nl>ns wl>ws (maximally independent) 3.09 21.00 nl>ns=_=>wl>ws 3.09 21.00 nl>ns=^=>wl>ws 3.00 29.00 nl>n2=_=>wl>ws=>nl (recurrence) 3.00 8.80 wl>ws=>wl (recurrence), nl>ns (no recurrence) 3.09 2.00 nl>ns=>nl (recurrence), wl>ws (no recurrence) 3.07 21.00 nl>ns=>nl,wl>ws 3.06 29.00 ws=_=>nl>ns=_=>wl 3.09 2.00 wl>ws=_=>nl>ns 3.09 22.33 wl>ws_=^>wl (recurrence) nl>ns (no recurrence) 3.00 29.00 ws=_=>nl>ns=^=>wl>ws (recurrence), also ws_=^>wl