Exercises 4

Deadline 2026-01-06 23:59

All filenames refer to the g0.complang.tuwien.ac.at. You should perform your measurements on this machine.

Submission

Following the Setup below will give you a directory 4 in a working directory of your git project, containing Exercises4.txt and other files.

Answer the questions by editing Exercises4.txt with a text editor, inserting parts of the transcript of your shell session(s) where needed. Don’t forget to commit and push the directory and the relevant files before the deadline.

Please do not just answer the questions, but document how you arrived at the answers (this is also useful for preparing yourself for the interview part of the exam later in the semester). A good way to do that for some questions is to provide transcripts of your shell sessions. You can either cut-and-paste the text from the terminal session, or use script -f <filename> to write the transcript to the named file (and then copy parts of this transcript into your Exercises4.txt).

Setup

cd into a working directory of your git project for Efficient Programs on the course machine. Perform

unzip /nfs/a5/anton/pub/anton/lvas/efficient/25/4.zip
git add 4
cd 4
make

The directory 4 contains Exercises4.txt (fill in the answers there) and the files mentioned below and a Makefile. Run make after changing a C file and before measuring the binary built from the C file.

The goal of this exercise is that you get hands-on experience on the potential performance benefits of SIMD, and the limits of auto-vectorization and manual vectorization.

There are source files *X*.c, *X*m.c that make compiles to binaries *X*-*V*-*C*, where *V* indicates the compilation options for *X*.c

*V*       flags
noauto -Wall -O3 -march=native -mtune=znver4 -ffast-math -fno-tree-vectorize
auto   -Wall -O3 -march=native -mtune=znver4 -ffast-math
manual -Wall -O3 -march=native -mtune=znver4 -ffast-math

The meaning of the flags is:

-Wall  warn about potential bugs.
-O3    turn on lots of optimizations, including auto-vectorization.
-march=native allow all instructions for the current machine.
              on the course machine that includes AVX-512.
-mtune=znver4 Tune for a Zen4 (actually use AVX-512 instructions).
              For some reason, gcc avoids AVX-512 with -mtune=native.
-ffast-math   This allows some transformations that may change the results.
              In particular, it allows to reassociate + and *,
              even though FP + and * are not associative.
-fno-tree-vectorize  Disable auto-vectorization.

The manual files actually also compile and link a source file *X*-manual.c instead of *X*.c. This source file uses the GNU C Vector extensions (which are also supported by clang) to manually vectorize all but <16 iterations of the code.

*C* indicates the compiler used: gcc or clang. These compilers produce different code with these flags, and sometimes one produces faster code, sometimes the other.

To present measurement results from perf stat, I recommend that you perform

export LC_NUMERIC=prog

before performing the measurements (or prepend LC_NUMERIC=prog to your perf stat commands) , for more readable numbers.

Q1 Speedup from vectorizing a data-parallel loop

dp1-*V*-*C* performs ~2M data-parallel computations on vectors of n=1..999 4-byte elements, with 2*n FP ops, for a total of ~2G FP ops. The SIMD width of AVX-512 is 64 bytes (16 elements).

What is the speedup (measured in cycles) of the best dp1-auto-*C* variant over the best dp1-noauto-*C* variant? What is the speedup of the best dp1-manual-*C* variant over the the best dp1-noauto-*C* variant? What is the number of FP ops per second (also known as FLOPS) of the programs you used for the answers above (If you assume that the number of FP ops is 2G, the answer is close enough)?

Q2 What obstructs vectorization?

dp2.c and dp3.c are copies of dp1.c (and likewise for dp[123]m.c). Change dp2.c such that auto-vectorization does not work with gcc. Change dp3.c such that auto-vectorization does not work with clang. Ideally, these changes should be minimal. To prevent too-trivial solutions, the results should consume at least 150M cycles. How do you determine whether auto-vectorization works? You could look at the generated machine code, but that would require you to analyze this code and you can be mislead by partial auto-vectorization (probably not for this example, but in general). Here we use the criterion that for compiler *C*, dp[23]-auto-*C* is not more than a factor 1.5 faster than dp[23]-noauto-*C*. Present the changes (shown with diff -u dp1.c dp2.c and likewise for dp3.c) and the cycle numbers.

What change prevents auto-vectorization by gcc? What change prevents auto-vectorization by clang?

Q3 When does gcc auto-vectorization fail?

dp4.c, dp4m.c, and dp4-manual.c are copies of the corresponding dp1 files. Change dp4.c in a way that you can still manually vectorize (i.e., where you can change dp4-manual.c in a way that produces the same results), but where auto-vectorization of gcc does not work (using the speedup criterion from Q2 for determining the success and failure of auto-vectorization and manual vectorization). You can introduce additional parameters and change dp4m.c correspondingly (including checking the outputs of the function in dp4.c). I found it helpful to first change dp4.c and examine the compiler output with

make dp4-auto-gcc.o && objdump -d dp4-auto-gcc.o

before making the corresponding changes to dp4-manual.c and dp4m.c.

If you like a challenge, declare all the pointer/array parameters that the function in dp4.c writes to as restrict (like the original function does for y).

Please show a vectorizable loop (extracted from dp4.c) that is vectorizable (equivalent manually vectorized code in dp4-manual.c), but that gcc does not auto-vectorize. Include the measurements that indicate that the non-vectorization of dp4-auto-gcc and the vectorization of dp4-manual-gcc.

Note: The corresponding task for clang is not included, because for simple data-parallel loops clang auto-vectorizes them even with a large number of non-restricted written-to arrays (checked by looking at the machine code, not by measuring the performance). However, I have seen clang’s auto-vectorizer fail at other vectorizable code. Unfortunately, that code cannot be expressed manually with the GNU C vector extensions.

Q4 Speedup for a reduction

red1-*V*-*C* performs ~1M reductions (input: vectors, output: scalars) on vectors of n=1..1000 4-byte elements, with 2*n FP ops, for a total of ~1G FP ops.

What is the speedup (measured in cycles) of the best red1-auto-*C* variant over the best red1-noauto-*C* variant? What is the number of FP ops per second (also known as FLOPS) of the programs you used for the answers above (If you assume that the number of FP ops is 1G, the answer is close enough)?

Bonus question (food for thought, not as deliverable): Why are the FP ops per cycle for the noauto variants worse than in dp1.c? Why do you not see the same effect for the auto variants?

Q5 Speedup for a generation

gen1-*V*-*C* performs ~1M generations (input: scalars, output: vectors) of vectors of n=1..1000 4-byte elements.

What is the speedup (measured in cycles) of the best gen1-auto-*C* variant over the best gen1-noauto-*C* variant?

Q6 Speedup for data-parallel loop with conditional execution

cond1-*V*-*C* performs ~2M loops, each of which conditionally performs 2 FP ops per iteration on 1..1000 4-byte elements of the associated arrays.

What is the speedup (measured in cycles) of the best cond1-auto-*C* variant over the best cond1-noauto-*C* variant?

Bonus question (food for thought, not as deliverable): Why are these programs slower than the the dp1 programs, even though they perform fewer useful FP ops? In particular, why are the noauto programs so much slower?

Q7 Vectorization overhead

dp5-*V*-*C* takes an argument n and performs 2M loops of a data-parallel computation with vectors with m=1..2*n-1 elements (average n) (4m FP ops, 3m loads and 3m stores).

Which is the first n for which dp5-auto-*C* is a factor >=1.05 faster than the fastest dp5-noauto-*C* (speed measured in cycles)? Only show the measurements for this n in Exercises4.txt, and the others in a transcript that you also include.

Which is the last n for which dp5-auto-*C*' is a factor <1.05 faster than the fastestdp5-noauto-C(speed measured in cycles)? Only show the measurements for this *n* inExercises4.txt`, and the others in a transcript that you also include.

Is there a n for which dp5-noauto-*C*' is a factor >=1.05 faster than the fastestdp5-auto-C` (speed measured in cycles)? If so, present the measurements for that n.

Q8 Effect of `restrict`

dp6-*V*-*C* performs the same computation as dp5-*V*-*C*, but the written-to pointers in axpy1() are not defined as restrict.

For n=3, what is the speedup (measured in cycles) of the best dp5-noauto-*C* over the best dp6-noauto-*C*? For n=3, what is the speedup of the best dp5-auto-*C* over the best dp6-auto-*C*?

Which is the first n for which dp6-auto-*C*' is a factor >=1.05 faster than the fastestdp6-noauto-C(speed measured in cycles)? Which is the last *n* for whichdp6-auto-C’ is a factor <1.05 faster than the fastest dp6-noauto-*C* (speed measured in cycles)? Which is the last n for which dp6-noauto-*C*' is a factor >=1.05 faster than the fastestdp6-auto-C` (speed measured in cycles)?

Note: If you define a pointer as restrict (i.e., that all memory accessed through this pointer is only accessed through this pointer), your program actually must not access the same memory in another way. The compiler may reorder the memory accesses relative to other memory accesses based on this promise, and if the other memory accesses access the same memory, this reordering may lead to incorrect results.