addq $1, %rax addq $1, %rax addq $1, %rax addq $1, %rax addq $1, %raxusually has 5 cycles of total latency. The P-Core of the Core i3-1315U (a Raptor Cove with 1.25MB L2 cache (or does this mean it is a Golden Cove?)), however, executes this sequence much faster. This microbenchmark finds out performance characteristics about that. There are 14 loops, most containing sequences like the one shown above, but there are others:
a i j n k-m addq %r8, %rax leaq 1(%rax),%r8 addq $1, %rax imul %rsi, %rax imul %rsi, %rax addq %rcx, %rax leaq 1(%r8),%rcx addq %rax,%r8 addq %rsi, %rax addq $1, %rax addq %rdx, %rax leaq 1(%rcx),%rdx addq $1, %rax addq $1, %rax addq %rsi, %rax leaq 1(%rdx),%rsi addq %rax,%rcx ... addq %rdi, %rax leaq 1(%rsi),%rax addq $1, %rax cmpq %rax, %r9 cmpq %rax, %rdi addq %rax,%rdx jg 2b jg 2b addq $1, %rax addq %rax,%rsi addq $1, %rax addq %rax,%r9 addq $1, %rax cmpq %rax, %rdi jg 2b
leaq
instead of addq
, and if we use more
than one register.
add $1, %eax
sequences by masking the ALU
resource contention with the latency of an imul
instruction. I.e., these resources can be consumed for free
during the 3-cycle latency of imul
.
The results we see are:
loop cyc/it Mops dep.adds remarks a 5.00 6 5 addq reg, %eax b 1.01 6 5 addq $1, %rax f 1.23 7 6 addq $1, %rax c 1.98 11 10 addq $1, %rax d 2.00 12 11 addq $1, %rax e 2.21 13 12 addq $1, %rax g 3.01 18 17 addq $1, %rax h 3.25 19 18 addq $1, %rax i 1.00 6 5 leaq 1(reg1),reg2 j 2.00 12 6 addq $1, %rax; addq %rax, reg n 4.00 3 1 imul %rsi, %rax; addq %rsi, %rax k 3.00 3 1 imul %rsi, %rax; addq $1, %rax l 3.01 12 10 imul %rsi, %rax; addq $1, %rax m 3.20 17 15 imul %rsi, %rax; addq $1, %rax o 3.00 12 10 imul ...; addq $0x333, %rax; ...; addq $-0x333+2, %rax p 7.39 12 10 imul ...; addq $0x334, %rax; ...; addq $-0x334+2, %rax q 3.00 12 10 imul ...; addq $-0x333, %rax; ...; addq $0x333+2, %rax r 7.72 12 10 imul ...; addq $-0x334, %rax; ...; addq $0x334+2, %rax s 3.00 12 10 imul; +1023+1023+1023+1023-1024-1024-1024-1024-1009+1023 s 3.74 12 10 imul; +1023+1023+1023+1023-1024-1024-1024-1024-1024+1038The Mops column counts the addq, cmpq+jg, leaq, and imul as Macro-ops.
addq %rsi, %rax
with addq $1, %rax
, the latency goes down to 3 cycles
(k). Additional adds don't increase the cycles for a loop
iteration up to and including a sequence of 10 addq $1,
%eax
(l). After that it rises slowly (m).
I believe that the hardware optimizes sequences of constant adds at
the renaming stage of the microarchitecture. Not later, because
according to
Chips and Cheese
Golden Cove (and, by extension, Raptor Cove) has only 5 ALUs, and we
see >5 addq
s per cycle. Not earlier, because
it works across the zero-cycle
store-to-load-forwarding (which probably happens not earlier
than the renaming stage). The renamer already performs move
elimination, the constant-add elimination could be a generalized
form of that.
I don't have a good explanation for the apparent latency of 0 for the adds following the imul in loops k and l. I would think that the additions accumulated in some state in the renamer have to be reified in a physical register when it is used for a proper ALU operation, and I would expect that to cost one cycle of latency. One theory I considered was that the microarchitecture internally has a multiply-add unit, but one would expect that to work also for loop n, but there we see an additional cycle of latency from the addition.
2: imul %rsi, %rax addq $0x333, %rax addq $0x333, %rax addq $0x333, %rax addq $0x333, %rax addq $0x333, %rax addq $-0x333+2, %rax addq $-0x333+2, %rax addq $-0x333+2, %rax addq $-0x333+2, %rax addq $-0x333+2, %rax cmpq %rax, %rdi jg 2bHere the largest intermediate sum is 4095 (0xfff); if we replace
0x333
with 0x334
(loop p,
intermediate sum 4100 (0x1004)), the loop runs much slower (7.39
cycles/iteration on average, but with quite a bit of variation
(although perf stat -r 100
reports only 0.1% variation).
Similarly first adding -0x333
5 times, then
adding 0x333+2
5 times is fast, while
adding -0x334
and 0x334+2
respectively is
slow. So apparently the limits of this optimization are that the
intermediate sum must stay in the 13-bit range -4096..4095.
The individual addends apparentl have to be in the range -1024..1023 (loops s and t).
make
or possibly something
like taskset -c 2 make
on a Linux system with perf. The
calls are scaled for 1G iterations (which you can see in the number of
branches shown in the perf output), so dividing the number of cycles
by 1G gives the cycles/iteration.
Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
Makefile | 23-Dec-2023 10:47 | 502 | ||
loop1.o | 23-Dec-2023 19:31 | 3.4K | ||
loop1.s | 23-Dec-2023 19:31 | 7.7K | ||
main | 23-Dec-2023 19:31 | 16K | ||
main.c | 23-Dec-2023 19:19 | 2.1K | ||
main.o | 23-Dec-2023 19:19 | 4.1K | ||