Zero-cycle constant adds

Most CPU cores have a latency of one cycle for an integer addition, so a sequence like

  	addq	$1, %rax
	addq	$1, %rax
	addq	$1, %rax
	addq	$1, %rax
	addq	$1, %rax

usually has 5 cycles of total latency. The P-Core of the Core i3-1315U (a Raptor Cove with 1.25MB L2 cache (or does this mean it is a Golden Cove?)), however, executes this sequence much faster. This microbenchmark finds out performance characteristics about that. There are 14 loops, most containing sequences like the one shown above, but there are others:

      a                 i                  j               n              k-m
addq %r8, %rax   leaq 1(%rax),%r8   addq $1, %rax   imul %rsi, %rax  imul %rsi, %rax
addq %rcx, %rax  leaq 1(%r8),%rcx   addq %rax,%r8   addq %rsi, %rax  addq $1, %rax
addq %rdx, %rax  leaq 1(%rcx),%rdx  addq $1, %rax                    addq $1, %rax
addq %rsi, %rax  leaq 1(%rdx),%rsi  addq %rax,%rcx                   ...
addq %rdi, %rax  leaq 1(%rsi),%rax  addq $1, %rax
cmpq %rax, %r9   cmpq %rax, %rdi    addq %rax,%rdx
jg   2b          jg   2b            addq $1, %rax
                                    addq %rax,%rsi
                                    addq $1, %rax
                                    addq %rax,%r9
                                    addq $1, %rax
                                    cmpq %rax, %rdi
                                    jg   2b

Loop a tests if the speedup also holds if both operands to the addition are registers.
Loop i tests if it also holds if we use leaq instead of addq, and if we use more than one register.
Loop j tests if it also holds if every intermediate result is actually being used.
Loop n measures the latency of a multiply with a register (that contains 1), and a register-register add, in preparation for:
Loops k-m measure the pure latency of the add $1, %eax sequences by masking the ALU resource contention with the latency of an imul instruction. I.e., these resources can be consumed for free during the 3-cycle latency of imul.

The results we see are:

loop cyc/it Mops dep.adds remarks
 a    5.00    6     5     addq reg, %eax
 b    1.01    6     5     addq $1, %rax
 f    1.23    7     6     addq $1, %rax
 c    1.98   11    10     addq $1, %rax
 d    2.00   12    11     addq $1, %rax
 e    2.21   13    12     addq $1, %rax
 g    3.01   18    17     addq $1, %rax
 h    3.25   19    18     addq $1, %rax
 i    1.00    6     5     leaq 1(reg1),reg2
 j    2.00   12     6     addq $1, %rax; addq %rax, reg
 n    4.00    3     1     imul %rsi, %rax; addq %rsi, %rax
 k    3.00    3     1     imul %rsi, %rax; addq $1, %rax
 l    3.01   12    10     imul %rsi, %rax; addq $1, %rax
 m    3.20   17    15     imul %rsi, %rax; addq $1, %rax
 o    3.00   12    10     imul ...; addq $0x333, %rax; ...; addq $-0x333+2, %rax
 p    7.39   12    10     imul ...; addq $0x334, %rax; ...; addq $-0x334+2, %rax
 q    3.00   12    10     imul ...; addq $-0x333, %rax; ...; addq $0x333+2, %rax
 r    7.72   12    10     imul ...; addq $-0x334, %rax; ...; addq $0x334+2, %rax
 s    3.00   12    10     imul; +1023+1023+1023+1023-1024-1024-1024-1024-1009+1023
 s    3.74   12    10     imul; +1023+1023+1023+1023-1024-1024-1024-1024-1024+1038

The Mops column counts the addq, cmpq+jg, leaq, and imul as Macro-ops.

We see from loop a that a general add has a latency of one cycle.
But adds with a constant that depend on each other run at 6 macro-ops per cycle (where cmpq+jg is a macro-op), at least if the loop matches the constraints perfectly (loops b,d,g).
For slight mismatches (loops f,c,e,h), there is a penalty beyond just needing a resource for a cycle.
Using leaq (loop i) works just as fast as using addq, and using several registers does not slow execution down, either.
If every intermediate result is used (loop j), the dependent adds are not slowed down to 1 per cycle, but of course the additional adds consume resources and that slows the whole loop down to 2 cycles/iteration.
We see a latency of 4 cycles (3 for the imul, 1 for the addq) in loop n.
Interestingly, if we replace the addq %rsi, %rax with addq $1, %rax, the latency goes down to 3 cycles (k). Additional adds don't increase the cycles for a loop iteration up to and including a sequence of 10 addq $1, %eax (l). After that it rises slowly (m).

What is happening here? I can only guess:

I believe that the hardware optimizes sequences of constant adds at the renaming stage of the microarchitecture. Not later, because according to Chips and Cheese Golden Cove (and, by extension, Raptor Cove) has only 5 ALUs, and we see >5 addqs per cycle. Not earlier, because it works across the zero-cycle store-to-load-forwarding (which probably happens not earlier than the renaming stage). The renamer already performs move elimination, the constant-add elimination could be a generalized form of that.

I don't have a good explanation for the apparent latency of 0 for the adds following the imul in loops k and l. I would think that the additions accumulated in some state in the renamer have to be reified in a physical register when it is used for a proper ALU operation, and I would expect that to cost one cycle of latency. One theory I considered was that the microarchitecture internally has a multiply-add unit, but one would expect that to work also for loop n, but there we see an additional cycle of latency from the addition.

Limit

I played around with variants of loop l to find out where the number limits of this optimization are: the last loops that had the same performance as l was o:

2: imul %rsi, %rax addq $0x333, %rax
	addq	$0x333, %rax
	addq	$0x333, %rax
	addq	$0x333, %rax
	addq	$0x333, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	cmpq	%rax, %rdi
	jg	2b

Here the largest intermediate sum is 4095 (0xfff); if we replace 0x333 with 0x334 (loop p, intermediate sum 4100 (0x1004)), the loop runs much slower (7.39 cycles/iteration on average, but with quite a bit of variation (although perf stat -r 100 reports only 0.1% variation). Similarly first adding -0x333 5 times, then adding 0x333+2 5 times is fast, while adding -0x334 and 0x334+2 respectively is slow. So apparently the limits of this optimization are that the intermediate sum must stay in the 13-bit range -4096..4095.

The individual addends apparentl have to be in the range -1024..1023 (loops s and t).

How to run this yourself

Download this directory and do make or possibly something like taskset -c 2 make on a Linux system with perf. The calls are scaled for 1G iterations (which you can see in the number of branches shown in the perf output), so dividing the number of cycles by 1G gives the cycles/iteration.

Previous work

After my measurements and writing the above, I found that Chipsandcheese have published about this two years earlier, interestingly in the article on Gracemont (in very short form), not in the one on Golden Cove.

Anton Ertl

Name	Last modified	Size

Parent Directory		-
main.o	23-Dec-2023 19:19	4.1K
main.c	23-Dec-2023 19:19	2.1K
main	23-Dec-2023 19:31	16K
loop1.s	23-Dec-2023 19:31	7.7K
loop1.o	23-Dec-2023 19:31	3.4K
Makefile	23-Dec-2023 10:47	502

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at www.complang.tuwien.ac.at Port 80