Deadline 2026-01-06 23:59
All filenames refer to the g0.complang.tuwien.ac.at. You should perform your measurements on this machine.
Following the Setup below will give you a directory 4 in a working
directory of your git project, containing Exercises4.txt
and other files.
Answer the questions by editing Exercises4.txt with a
text editor, inserting parts of the transcript of your shell session(s)
where needed. Don’t forget to commit and push the directory and the
relevant files before the deadline.
Please do not just answer the questions, but document how you arrived
at the answers (this is also useful for preparing yourself for the
interview part of the exam later in the semester). A good way to do that
for some questions is to provide transcripts of your shell sessions. You
can either cut-and-paste the text from the terminal session, or use
script -f <filename> to write the transcript to the
named file (and then copy parts of this transcript into your
Exercises4.txt).
cd into a working directory of your git project for
Efficient Programs on the course machine. Perform
unzip /nfs/a5/anton/pub/anton/lvas/efficient/25/4.zip
git add 4
cd 4
make
The directory 4 contains Exercises4.txt
(fill in the answers there) and the files mentioned below and a
Makefile. Run make after changing a C file and
before measuring the binary built from the C file.
The goal of this exercise is that you get hands-on experience on the potential performance benefits of SIMD, and the limits of auto-vectorization and manual vectorization.
There are source files *X*.c, *X*m.c that
make compiles to binaries *X*-*V*-*C*, where
*V* indicates the compilation options for
*X*.c
*V* flags
noauto -Wall -O3 -march=native -mtune=znver4 -ffast-math -fno-tree-vectorize
auto -Wall -O3 -march=native -mtune=znver4 -ffast-math
manual -Wall -O3 -march=native -mtune=znver4 -ffast-math
The meaning of the flags is:
-Wall warn about potential bugs.
-O3 turn on lots of optimizations, including auto-vectorization.
-march=native allow all instructions for the current machine.
on the course machine that includes AVX-512.
-mtune=znver4 Tune for a Zen4 (actually use AVX-512 instructions).
For some reason, gcc avoids AVX-512 with -mtune=native.
-ffast-math This allows some transformations that may change the results.
In particular, it allows to reassociate + and *,
even though FP + and * are not associative.
-fno-tree-vectorize Disable auto-vectorization.
The manual files actually also compile and link a source
file *X*-manual.c instead of *X*.c. This
source file uses the GNU C Vector
extensions (which are also supported by clang) to manually vectorize
all but <16 iterations of the code.
*C* indicates the compiler used: gcc or
clang. These compilers produce different code with these
flags, and sometimes one produces faster code, sometimes the other.
To present measurement results from perf stat, I
recommend that you perform
export LC_NUMERIC=prog
before performing the measurements (or prepend
LC_NUMERIC=prog to your perf stat commands) ,
for more readable numbers.
dp1-*V*-*C* performs ~2M data-parallel computations on
vectors of n=1..999 4-byte elements, with 2*n FP ops, for a total of ~2G
FP ops. The SIMD width of AVX-512 is 64 bytes (16 elements).
What is the speedup (measured in cycles) of the best
dp1-auto-*C* variant over the best
dp1-noauto-*C* variant? What is the speedup of the best
dp1-manual-*C* variant over the the best
dp1-noauto-*C* variant? What is the number of FP ops per
second (also known as FLOPS) of the programs you used for the answers
above (If you assume that the number of FP ops is 2G, the answer is
close enough)?
dp2.c and dp3.c are copies of
dp1.c (and likewise for dp[123]m.c). Change
dp2.c such that auto-vectorization does not work with gcc.
Change dp3.c such that auto-vectorization does not work
with clang. Ideally, these changes should be minimal. To prevent
too-trivial solutions, the results should consume at least 150M cycles.
How do you determine whether auto-vectorization works? You could look at
the generated machine code, but that would require you to analyze this
code and you can be mislead by partial auto-vectorization (probably not
for this example, but in general). Here we use the criterion that for
compiler *C*, dp[23]-auto-*C* is not more than
a factor 1.5 faster than dp[23]-noauto-*C*. Present the
changes (shown with diff -u dp1.c dp2.c and likewise for
dp3.c) and the cycle numbers.
What change prevents auto-vectorization by gcc? What change prevents auto-vectorization by clang?
dp4.c, dp4m.c, and
dp4-manual.c are copies of the corresponding
dp1 files. Change dp4.c in a way that you can
still manually vectorize (i.e., where you can change
dp4-manual.c in a way that produces the same results), but
where auto-vectorization of gcc does not work (using the speedup
criterion from Q2 for determining the success and failure of
auto-vectorization and manual vectorization). You can introduce
additional parameters and change dp4m.c correspondingly
(including checking the outputs of the function in dp4.c).
I found it helpful to first change dp4.c and examine the
compiler output with
make dp4-auto-gcc.o && objdump -d dp4-auto-gcc.o
before making the corresponding changes to dp4-manual.c
and dp4m.c.
If you like a challenge, declare all the pointer/array parameters
that the function in dp4.c writes to as
restrict (like the original function does for
y).
Please show a vectorizable loop (extracted from dp4.c)
that is vectorizable (equivalent manually vectorized code in
dp4-manual.c), but that gcc does not auto-vectorize.
Include the measurements that indicate that the non-vectorization of
dp4-auto-gcc and the vectorization of
dp4-manual-gcc.
Note: The corresponding task for clang is not included, because for
simple data-parallel loops clang auto-vectorizes them even with a large
number of non-restricted written-to arrays (checked by
looking at the machine code, not by measuring the performance). However,
I have seen clang’s auto-vectorizer fail at other vectorizable code.
Unfortunately, that code cannot be expressed manually with the GNU C
vector extensions.
red1-*V*-*C* performs ~1M reductions (input: vectors,
output: scalars) on vectors of n=1..1000 4-byte elements, with 2*n FP
ops, for a total of ~1G FP ops.
What is the speedup (measured in cycles) of the best
red1-auto-*C* variant over the best
red1-noauto-*C* variant? What is the number of FP ops per
second (also known as FLOPS) of the programs you used for the answers
above (If you assume that the number of FP ops is 1G, the answer is
close enough)?
Bonus question (food for thought, not as deliverable): Why are the FP ops per cycle for the noauto variants worse than in dp1.c? Why do you not see the same effect for the auto variants?
gen1-*V*-*C* performs ~1M generations (input: scalars,
output: vectors) of vectors of n=1..1000 4-byte elements.
What is the speedup (measured in cycles) of the best
gen1-auto-*C* variant over the best
gen1-noauto-*C* variant?
cond1-*V*-*C* performs ~2M loops, each of which
conditionally performs 2 FP ops per iteration on 1..1000 4-byte elements
of the associated arrays.
What is the speedup (measured in cycles) of the best
cond1-auto-*C* variant over the best
cond1-noauto-*C* variant?
Bonus question (food for thought, not as deliverable): Why are these
programs slower than the the dp1 programs, even though they
perform fewer useful FP ops? In particular, why are the
noauto programs so much slower?
dp5-*V*-*C* takes an argument n and performs 2M
loops of a data-parallel computation with vectors with m=1..2*n-1
elements (average n) (4m FP ops, 3m loads and 3m stores).
Which is the first n for which dp5-auto-*C* is
a factor >=1.05 faster than the fastest dp5-noauto-*C*
(speed measured in cycles)? Only show the measurements for this
n in Exercises4.txt, and the others in a
transcript that you also include.
Which is the last n for which
dp5-auto-*C*' is a factor <1.05 faster than the fastestdp5-noauto-C(speed measured in cycles)? Only show the measurements for this *n* inExercises4.txt`,
and the others in a transcript that you also include.
Is there a n for which
dp5-noauto-*C*' is a factor >=1.05 faster than the fastestdp5-auto-C`
(speed measured in cycles)? If so, present the measurements for that
n.
restrictdp6-*V*-*C* performs the same computation as
dp5-*V*-*C*, but the written-to pointers in
axpy1() are not defined as restrict.
For n=3, what is the speedup (measured in cycles) of the
best dp5-noauto-*C* over the best
dp6-noauto-*C*? For n=3, what is the speedup of
the best dp5-auto-*C* over the best
dp6-auto-*C*?
Which is the first n for which
dp6-auto-*C*' is a factor >=1.05 faster than the fastestdp6-noauto-C(speed measured in cycles)? Which is the last *n* for whichdp6-auto-C’
is a factor <1.05 faster than the fastest dp6-noauto-*C*
(speed measured in cycles)? Which is the last n for which
dp6-noauto-*C*' is a factor >=1.05 faster than the fastestdp6-auto-C`
(speed measured in cycles)?
Note: If you define a pointer as restrict (i.e., that
all memory accessed through this pointer is only accessed through this
pointer), your program actually must not access the same memory in
another way. The compiler may reorder the memory accesses relative to
other memory accesses based on this promise, and if the other memory
accesses access the same memory, this reordering may lead to incorrect
results.