Index of /anton/pred-store-forward

Name	Last modified	Size

Parent Directory		-
Makefile	2021-04-26 10:42	1.6K
cmove	2021-04-26 09:24	17K
cmove-gforth	2021-04-26 09:24	17K
cmove-gforth.c	2021-04-26 09:15	113
cmove-gforth.o	2021-04-26 09:23	1.2K
cmove.c	2021-04-18 18:26	131
cmove.o	2021-04-24 18:59	1.2K
lmove	2021-04-26 09:24	17K
lmove.c	2021-04-18 18:28	149
lmove.o	2021-04-24 18:59	1.2K
main.c	2021-04-18 13:46	400
main.o	2021-04-24 18:59	1.8K
main1.c	2021-04-18 13:40	397
main1.o	2021-04-24 18:59	1.8K
main2.c	2021-04-26 09:24	589
main2.o	2021-04-26 09:24	2.3K
move	2021-04-24 18:59	16K
move.c	2021-04-18 12:34	117
move.o	2021-04-24 18:59	1.2K
move1	2021-04-24 18:59	16K
move1.c	2021-04-18 12:41	124
move1.o	2021-04-24 18:59	1.2K
move1a	2021-04-24 18:59	16K
move2	2021-04-24 18:59	16K
move2.c	2021-04-18 13:16	138
move2.o	2021-04-24 18:59	1.2K
move3	2021-04-24 18:59	16K
move3.c	2021-04-18 13:17	140
move3.o	2021-04-24 18:59	1.2K
smove	2021-04-26 09:24	17K
smove.c	2021-04-18 18:27	153
smove.o	2021-04-24 18:59	1.2K

Article: 80706 of comp.arch
Path: eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Predictive Store Forwarding
Date: Sun, 18 Apr 2021 10:03:31 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 175
Message-ID: <2021Apr18.120331@mips.complang.tuwien.ac.at>
References: <2021Apr10.132712@mips.complang.tuwien.ac.at> <%lmcI.37907$z%6.36586@fx46.iad> <c0322d51-f938-4615-9167-9f89f21ffd19n@googlegroups.com> <gCfeI.2004$4H2.1589@fx21.iad> <D4ieI.27704$N_4.8675@fx36.iad> <2021Apr16.192352@mips.complang.tuwien.ac.at> <FmkeI.14410$lyv9.13421@fx35.iad> <XOneI.3805$8O4.2396@fx16.iad>
Injection-Info: reader02.eternal-september.org; posting-host="bcae12e90f8c9b13399017d3653c1999";
	logging-data="6469"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+75FQ1gBDzSM1Nvh8cXjMD"
Cancel-Lock: sha1:kEWdaFk5FrFUDx6NWxX3ubC10bs=
X-newsreader: xrn 10.00-beta-3
Xref: reader02.eternal-september.org comp.arch:80706

EricP <ThatWouldBeTelling@thevillage.com> writes:
>EricP wrote:
>> Anton Ertl wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>>> That appears to be bingo! except I missed the optimization where the 
>>>> load
>>>> dest register may be renamed to the store's source register, if 
>>>> possible.
>>>
>>> The 6.26 cycles on Zen3 per load-store pair in a
>>> load-store-load-store... dependence chain indicates that this does not
>>> happen, at least not for this code:
>>>
>>>    0x000055e7154dfc50 <cmove+16>:       movzbl (%rdi,%rax,1),%edx
>>>    0x000055e7154dfc54 <cmove+20>:       mov    %dl,(%rsi,%rax,1)
>>>    0x000055e7154dfc57 <cmove+23>:       add    $0x1,%rax
>>>    0x000055e7154dfc5b <cmove+27>:       cmp    %rcx,%rax
>>>    0x000055e7154dfc5e <cmove+30>:       jne    0x55e7154dfc50 <cmove+16>
>>>
>>> which comes from this code:
>>>
>>>   while (u-- > 0)
>>>     *c_to++ = *c_from++;
>>>
>>> Would "mov (%rdi,%rax,1), %dl" instead of the first instruction work
>>> better?
>>>
>>> - anton
>> 
>> How about starting with full register width MOV's?

So I wrote a benchmark (move) that has as inner loop:

   d:   48 8b 14 c7             mov    (%rdi,%rax,8),%rdx
  11:   48 89 14 c6             mov    %rdx,(%rsi,%rax,8)
  15:   48 83 c0 01             add    $0x1,%rax
  19:   48 39 c1                cmp    %rax,%rcx
  1c:   75 ef                   jne    d <move+0xd>

which BTW comes out of

  for (i=0; i<count; i++)
    to[i] = from[i];

And just to remind casual readers, the to[i] on one iteration is the
from[i] of the next iteration.

Here's what I measure (with speculative store bypass enabled on both
machines):

cycles/it
1.02      Zen3
6.82      Zen2

So it seems that Zen3 does indeed rename or copy the store's source
register into the load's target register, when it's a full register.

I also tried it with a char array instead of a long array, and
strangely, even though the resulting code is almost the same as
the cmove code above:

   d:   0f b6 14 07             movzbl (%rdi,%rax,1),%edx
  11:   88 14 06                mov    %dl,(%rsi,%rax,1)
  14:   48 83 c0 01             add    $0x1,%rax
  18:   48 39 c1                cmp    %rax,%rcx
  1b:   75 f0                   jne    d <move+0xd>

I now get 1.02 cycles per iteration, too.

With short and unsigned short, I get 1.63 cycles per iteration.

With int and unsigned int, I get 3.13 cycles per iteration.

>If PSF is doing what I suggest then it should demonstrate when
>the load effective address has some non-trivial calculation
>so that the load uOp arrives at the LSQ before the address is ready,
>and it pairs with a prior store still in the LSQ.
>Maybe something with a sum of some useless multiply-by-zero's
>that can't be optimized away and introduce some latency.
>
>   index = i*j + k*l + m;  // i=j=k=l=0
>   to[m] = from[index];
>   m++;

To test this, I made a variation (move1a) of the code above:

  for (i=0; i<count; i+=incr[i])
    to[i] = from[i];

   d:   48 8b 14 c7             mov    (%rdi,%rax,8),%rdx
  11:   48 89 14 c6             mov    %rdx,(%rsi,%rax,8)
  15:   49 03 04 c0             add    (%r8,%rax,8),%rax
  19:   48 39 c1                cmp    %rax,%rcx
  1c:   7f ef                   jg     d <move+0xd>

where incr[i] is the to[i] of the last iteration (always 1).  What I
get is:

cycles/it
 1.34     Zen3
14.03     Zen2

However, it's not a good proof of PSF at work, because it just shows
that the register is also forwarded for the incr part, and if we
assume that SSB can do that, too, there is no long latency involved.

So I changed incr to not overlap from and to (but still give 1), and
voila (move1):

cycles/it
6.02      Zen3
7.03      Zen2

My explanation for the Zen3 result is that we have one dependence
cycle per iteration that includes a load and an add, resulting in 6
cycles per iteration; we cannot know from this benchmark whether the
copying part is delayed or not, because the loop control does not
depend on it.  The Zen2 result is longer because already the variant
with i++ takes almost 7 cycles.

So let's try something where the result of the copying is involved (move2):

  for (i=0; i<count; i+=incr[x])
    x = to[i] = from[i];

   d:   48 8b 14 c7             mov    (%rdi,%rax,8),%rdx
  11:   48 89 14 c6             mov    %rdx,(%rsi,%rax,8)
  15:   49 03 04 d0             add    (%r8,%rdx,8),%rax
  19:   48 39 c1                cmp    %rax,%rcx
  1c:   7f ef                   jg     d <move+0xd>

cycles/it
 1.43     Zen3
11.03     Zen2

Again, x=1, and incr[x]=1, too.  It seems that there is another
optimization on Zen3 at work here: If loading from the same memory
address in short order, the result register is copied rather than
going through the memory unit.

To work around this, another variant (move3):

  for (i=0; i<count; i+=incr[x+i])
    x = to[i] = from[i];

   d:   48 8b 14 c7             mov    (%rdi,%rax,8),%rdx
  11:   48 89 14 c6             mov    %rdx,(%rsi,%rax,8)
  15:   48 01 c2                add    %rax,%rdx
  18:   49 03 04 d0             add    (%r8,%rdx,8),%rax
  1c:   48 39 c1                cmp    %rax,%rcx
  1f:   7f ec                   jg     d <move+0xd>

where incr does not overlap from and to (and incr[x+i]=1).

cycles/it
 7.03     Zen3
12.04     Zen2

The 7 cycles on Zen3 are due to the latency of the load and the two
adds for computing i in every cycle.  The from[i] load depends on the
(obviously slow) incr[x+i] load, but does not seem to add any latency,
so it seems that the forwarding works.

It's interesting that on Zen2 move3 is faster than move1a; apparently
there is some extra latency involved on Zen2 when loading the result
of a recent store.

You can find the code for these benchmarks at
<http://www.complang.tuwien.ac.at/anton/pred-store-forward/> and build
and benchmark them with "make".

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>