Due Friday, November 20, 1998 at 3:00 PM
In this lab you'll do performance analysis and improvement of a matrix multiply that is too big to fit into some or all levels of cache memory.
To get started, create a working directory and make a copy of the working files for this homework. For example:
cp /afs/ece.cmu.edu/class/ece548/www/hw/lab5/* .
You will then have the following files to work with:
You are strongly urged to get an early start on this lab. You may want to set up a file using the "at" facility and run your simulations overnight, testing them first to make sure they're likely to "go". Also, think before you run -- it's easy to burn up a lot of CPU hours unnecessarily, depriving your classmates of machine cycles. In particular, it is not necessary (nor desirable) to run multi-level cache simulations for any problem except problem 1.
Problem 1: Collect Three-Level Dinero Data
This problem demonstrates how to use Atom and Dinero to collect multi-level cache data.
1) The "prob1" file provided in the homework directory will produce 3-level cache data for a 100x100 matrix multiply. Both an uninstrumented program and an instrumented program are executed. Note that the instrumented program reports a much longer elapsed time because at least some of the time taken by the instrumentation is also charged against the program. Execute this file and report:
2) Report and explain the data write miss ratios for both the L1 and L2 caches (i.e., what is going on with the code to cause each of them to be either large or small, depending on what you find?).
Problem 2: Diagnose a Performance Problem
1) Measure execution times for N=250 through N=262 for the uninstrumented matmul.c program compiled with optimization (see the file prob2). Plot the execution times on an appropriate graph. Use the same compiler optimizations as in problem 1, but change the definition of N, e.g. "DN=260" would set N to the value 260 for that compilation. Do not do dinero simulations -- just run the programs uninstrumented.
2) Find out what is going on in the cache at N=256. Run a dinero simulation for L1 cache only at N=256 using the same compilation and simulation parameters as used in previous runs. Explain what the results mean about the program behavior in general terms. Be sure to use the -z option to get a run that isn't stopped early. Important -- this is a single-level cache simulation only -- don't run the 3-level cache script for this problem or any of the remaining problems -- it will take a lot of CPU time for no reason! If you haven't used them before, ask for help using the "at" or "nohup" Unix commands so you don't have to run this simulation interactively.
Problem 3: Use a transpose matrix
Modify the subroutine my_matmul in matmul.c to use a transposed b matrix by adapting code from examples.c (take code from the routine transpose()).
1) Measure the uninstrumented run time for the transposed algorithm at N=256. It should be faster than the original matrix multiply, even though it does the extra work of copying the b array before doing the actual multiply. List the execution times for original matrix and transposed matrix code. Include a listing of your new my_matmul routine with the transpose algorithm.
2) Why is the transpose routine so much faster? Make a hypothesis, and test that hypothesis at N=251 so that the anomaly at N=256 isn't a factor. You don't have to have the correct answer to get full credit -- just give a plausible answer and a reasonable test with results that either prove it or disprove it. It's OK to use cache simulation parameters from the servers even if you actually measured results on an Alpha workstation -- the on-chip caches are not very different. If you're having trouble coming up with a hypothesis, send the course staff e-mail for a detailed hint.
The point Problem 3 is to take one trip through the scientific method cycle:
- Do a timed run of both versions at N=251 and note the speed difference
- Perhaps do timed runs at other N values if you like
- Look at simulation results from previous parts of the homework
- Take a plausible guess at what might be going on here based on your experience, observations, and a look at the source code
- Formulate some experiment that you can do to test your hypothesis and run it. (Hint: the chances that something going on in L3 cache cause that big a performance difference are low, so I wouldn't bother with a 3-level cache simulation given how long it takes to run; but an L1-only cache simulation might possibly be in order...)
- Did the experiment work? What did you learn?
- If your experiment didn't "work" that's OK too. Just say what you were trying to find and what you ruled out. The point is to experience the above process, not to necessarily solve the entire problem if that becomes a time sink.
Problem 4: Diagnose a Performance Problem with matrix multiply
In this problem you'll be characterizing and optimizing the performance of a large matrix multiply which doesn't fit into cache. In particular, you must make a 1024x1024 long-int matrix multiply go fast.
Starting with the original program matmul.c, and using the examples.c program for ideas, create a performance-optimized matrix multiply routine for the Alphaservers. The objective is to create a matrix multiplication program that performs well for a matrix sized too large to fit into L3 cache.
The rules are:
Grading will be based on:
1) Keep a log of every optimization you try with a one-line description and execution time (it would be no surprise if there were dozens of entries in this). You will not be graded on how many tries it took; anyone with a believable and readable log will get full credit for this sub-problem.
2) Give a listing of your final optimized my_matmul() routine.
3) Give a listing of how you invoked the compiler (hint: you might want to compute OFFSET and FILLER based on the value of N). This might be a command line or a make file depending on how fancy you get.
4) What was execution time, the speedup you achieved and the machine load (using uptime) when you measured the speedup? If the machine is not completely idle you will probably want to average a few trials. If you get close to the required execution time but can't get faster than that, consult with the course staff to determine if the problem is heavy machine loads from other students or with your code. As a hint, if you're not using a block decomposition strategy of some kind, you aren't there yet.
5) Draw a picture showing the access pattern of the arrays in your optimized code. Label arrows with the names of all your loop variables.
We realize this is a rather open-ended assignment. We're willing to enter a dialogue via e-mail and newsgroup if you feel you don't have sufficient direction. The assignment is supposed to be enlightening, not frustrating, so let us know if there are ways to improve the experience. There are probably several ways to do problem 4 with very little work (and, once again, if you didn't literally copy the code, it isn't cheating). In general the benefit of this lab is proportional to the thought (not necessarily hours) you put into it.
You might want to experiment with optimizations on smaller arrays (say, 128x128 arrays) to understand the algorithms. This will reduce the time to run an experiment until you are zeroing in on the final answer.
There is a "feature" in Digital Unix which can cause the unoptimized program to run much faster than what is stated above, but leaves the optimized program unaffected -- if you encounter this "feature" we can either reboot the machine you are using or tell you how to do the reboot. (Isn't dealing with real hardware and software more fun than just getting simulation results?) At any rate, the thing that really counts is getting optimized performance below the targets, and the targets are unaffected by this problem.
A student in a previous semester asked if a getting this speedup was realistic, since it is very close to the theoretical limit of no memory accesses at all. The short answer is, yes it's realistic; that's the point of being clever about exploiting the memory hierarchy. Below are actual speedups measured on the course machines. Note that the faster machine gets a greater speedup by improving cache performance because main memory speed is the same on both machines (thus an constant-duration L3 cache miss penalty at 500 MHz is more clocks than the same penalty at 433 MHz).
|Machine Speed||433 MHz||433 MHz||500 MHz||500 MHz|
|Program||Execution Time||Speedup||Execution Time||Speedup|
|Original matmul||305.00 seconds||--||289.28 seconds||--|
|Improved matmul (machine load=1)||31.48 seconds||9.69||25.82 seconds||11.2|
|Improved matmul (machine load=4)||33.12 seconds (average)||9.21||27.17 seconds (average)||10.65|
You will not be penalized heavily (and in many cases will get full credit) if you make a good try and don't quite hit the speedup target as long as you provide all the information for the program you developed as requested in the five problem parts assigned above.
18-548/15-548 home page.