Index of /anton/autovectors

[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[   ]Makefile12-Jun-2016 08:00 1.0K 
[TXT]autovectors.c10-Jun-2016 17:41 212  
[   ]autovectors.o11-Jun-2016 18:53 4.2K 
[   ]autovectors.tar.gz12-Jun-2016 19:23 14K 
[DIR]autovectors/12-Jun-2016 19:23 -  
[   ]autovectors1.o08-Jun-2016 18:33 1.6K 
[   ]autovectors1.s08-Jun-2016 17:55 2.1K 
[TXT]main.c08-Jun-2016 18:32 400  
[   ]main.o08-Jun-2016 18:33 7.9K 
[   ]movdqa11-Jun-2016 18:53 12K 
[   ]movdqu11-Jun-2016 18:53 10K 
[DIR]philiptaylor/05-Mar-2022 13:35 -  
[   ]standard12-Jun-2016 08:00 11K 
[TXT]standard.c12-Jun-2016 07:47 197  
[   ]standard.o12-Jun-2016 08:00 4.3K 

This is a microbenchmark (with results) of gcc autovectorization, in
particular the claims by gcc advocates that using MOVDQA instead of
MOVDQU provides significant speedup that justify breaking existing
code.  In particular, Jakub Jelinek claims in
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709#c10>:

|On most CPUs there is a significant performance difference between the
|two, even if you use MOVDQU with aligned addresses.

<http://lwn.net/Articles/690030/> is more specific by mentioning the
K10 and Core 2 as CPUs benefitting from breaking existing code.  

[In the course of writing this, I came across
<https://software.intel.com/comment/1470256#comment-1470256>, where an
Intel performance guy writes that VMOVDQU vs. VMOVDQA is neutral for
performance.]

The microbenchmark itself is based on the code snippets in
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709>:

autovectors.c:
typedef unsigned long U64;

static void LZ4_copy8(void* dstPtr, const void* srcPtr)
{
   *(U64*)dstPtr = *(U64*)srcPtr;
}

void bar(void *d, void *s, void*e) {
  do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e);
}

There are three versions of this:

autovectors.c, movdqa is what gcc-4.9 -O3 generates.
autovectors1.s, movdqu takes the output of gcc-4.9 -O3 -S autovectors.c
      and replaces MOVDQA with MOVDQU
standard.c, standard follows the suggestion of gcc advocates and
      replaces the non-standard usage in autovectors.c with more
      standard-conforming code.

static void LZ4_copy8(void* dstPtr, const void* srcPtr)
{
  memcpy(dstPtr,srcPtr,8);
}

Each of the binaries takes two digits as parameters for specifying the
alignment: 0 indicates that the pointer is 8-byte-aligned, 1-7 results
in misalignment (the autovectorized code manages to get from 8-byte
alignment to 16-byte alignment by itself).  The first digit is for the
s parameter of bar, the second for the d parameter.  E.g.:

movdqa 0 0

The microbenchmark copies 1800 bytes 1,000,000 times between the same
two buffers, so it runs completely inside the L1 D-cache.  This should
be pretty close to the best case for showing a speed difference
between MOVDQA und MOVDQU: No time spent in cache misses or processing
the data between loading and storing, and the proportion of time spent
on ramping up to and ramping down from the loop kernel should be
small.  Note that the autovectorizer uses MOVDQA only for loading from
memory, and uses MOVUPS (interestingly, not MOVAPS) to store the data.

You can run measurements with "make perf"; you should disable
hyperthreading for this.  There is also "make perfex", but you
probably don't have perfctr/perfex installed, and if you have, the
event number used may not work on your CPU (I used it for the Core 2
results below).

Results in cycles (for parameters 0 0):
K10        Core 2     Sandy B.   Haswell    Skylake
390M211868 718M253607 257M454690 185M785903 188M783931 movdqa
390M284189 719M255845 257M487631 185M856962 188M884263 movdqu
475M590600 471M249291 471M456199 317M635572 326M291805 standard

other results not relevant for the question at hand, but still
interesting
for 1 0 (load unaligned, store aligned):
K10        Core 2     Sandy B.   Haswell    Skylake
SIGSEGV    SIGSEGV    SIGSEGV    SIGSEGV    SIGSEGV   movdqa  
387M947295 947M297315 258M437034 186M345663 189M083883 movdqu  
476M917807 803M298891 471M460954 317M637046 326M618955 standard

for 0 1 (load aligned, store unaligned):
K10        Core 2      Sandy B.   Haswell    Skylake
392M462713 1551M331512 257M456897 202M449047 189M155890 movdqa  
391M207515 1659M342683 257M506148 201M601918 188M808331 movdqu  
476M800302  986M541600 471M466702 325M796613 324M676906 standard

for 1 1 (both unaligned):
K10        Core 2      Sandy B.   Haswell    Skylake
SIGSEGV    SIGSEGV     SIGSEGV    SIGSEGV    SIGSEGV    movdqa  
387M730013 2007M433159 258M442437 201M750963 189M109038 movdqu  
581M039547 1281M449651 471M465216 325M753548 325M758253 standard

The actual CPU models are:
K10: AMD Phenom(tm) II X2 560 Processor
Core 2: Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
Sandy B.: Intel(R) Xeon(R) CPU E31220 @ 3.10GHz
Haswell: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Skylake: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz


Discussion

The biggest speed difference between MOVDQU and MOVDQA on this
microbenchmark is a factor 1.0014, and it is seen on the Core 2; but
on this CPU the best performance would come from not vectorizing, so
this is not a good argument for using MOVDQA.  In any case, there is
no significant speed difference between the two on aligned data,
contrary to the claim by Jakub Jelinek.

So, whenever a GCC or LLVM maintainer or advocate makes a claim about
performance to justify breaking programs with undefined behaviour, and
(as usual) they don't support it with empirical data, I recommend not
believing the claim; tell them that you don't believe it and ask for
empirical support of the claim.

[I also played with a different micro-benchmark (the sum function in
the appendix), and found that gcc-4.9 generates MOVDQU for it (so I
did not use it as microbenchmark for this test).  It looks to me like
the MOVDQA in the bug-reported case is just a result of a
seemingly-arbitrary decision, and much of what was written as response
to the bug report is just a rationalization justifying that decision.]

Now, concerning the advice given by some to make your program
standard-compliant, let's see how that affects performance: For the
"standard" program, the autovectorizer does not trigger, and we get
non-vectorized code, instead of code with MOVDQU.  Apart from the Core
2 results, this code is significantly slower (factore 1.21-1.83) than
the movdqu code.  So the end result of the choices of the gcc
maintainers is that their much-touted autovectorization first breaks
previously-working code, putting workload on the maintainers of that
code, and the suggested remedy then disables the feature.  If they had
chosen to use MOVDQU, the code would not need to be changed, and would
benefit from auto-vectorization.

You might argue that my minimally-invasive change to make the code
more standard-compliant is suboptimal for this microbenchmark, and you
would be right.  However, in the full-blown application it is probably
harder to see further optimization opportunities, and the maintainer
has other things on his mind at the time, so I think that a change
like the one I did would be a typical result in this situation.

Concerning the non-aligned results, it's surprising to me how little
the lack of alignment hurts 128-bit loads and stores (actually, no
significant slowdown on most CPUs); I find this particularly
surprising for stores.


Appendix:

void sum(int *a, int *b, int *c) {
  int i;

  for (i=0; i<256; i++){
    a[i] = b[i] + c[i];
  }
}