18-548/15-548 Fall 1998

Lab 4:
Multi-Level Cache &
Context Switching Effects

Due Friday, October 23, 1998 at 3:00 PM

Draft until Friday 10/9

This is a simulation homework that takes a look at the benefits of multi-level caching and an example of the effects of context switching on cache performance. You will also see an example of how to instrument a program using Atom, as well as running a multi-level Dinero simulation. Both problems require use of an Alpha workstation.

The homework-specific files you will need are in /afs/ece/class/ece548/www/hw/lab4 . There are subdirectories with example scripts for both problems, and you are welcome to make use of them. In order to run correctly, the scripts must be copied to your own directory, and must have a logical link to the lab4 directory
ln -s /afs/ece/class/ece548/www/hw/lab4
in the same directory from which you run the script. Note that the scripts are representative, and may not necessarily do everything required to answer the lab questions.

Problem 1: Multi-Level Caching

In this problem you'll analyze the performance of a program with respect to various single- and multi-level cache configurations. The program you'll be instrumenting is madd.c which is a simple matrix addition program. Please note that the point of this program is to have something that is relatively simple to understand but still has interesting behavior -- so while it is possible to optimize performance of this program by loop rearrangement and eliminating the second call to madd(), that is not the point of this exercise. (Your chance to optimize performance on "real" code comes in the next lab...) All terminology is Cragon-based (sectors+blocks) rather than dinero-based. Atom is used to instrument only the procedure madd(), so you don't need to worry about the caching effects of any other part of the program.

1. Compile and atomize madd.c and report dinero simulation results for the three cache configurations below. For each case show the dinero command line, data read miss rate, data write miss rate, and total bus usage. The first case is a straightforward write-through cache with sector size=block size of 4 words. The second case is a write-back cache having otherwise similar parameters. The third case is an L2-like unified cache.

  1. dinero -i8K -d8K -a4 -ww -An -b32 -W8 -B16
  2. dinero -i8K -d8K -a4 -wc -Aw -b32 -W8 -B16
  3. dinero -u512K -a4 -wc -Aw -b32 -W8 -B16

2. For the results in part 1 above, answer the following questions.

  1. Why is the write miss ratio of the 8K write-through cache essentially 100%?
  2. Why is the read miss ratio of the 8K write-back cache essentially 100% and the bus usage so much higher?
  3. Why does the 512K cache have better performance than the small write-back cache?
  4. What you'd really like for the 512K cache is to be able to use a block size of 8 bytes on a 32-byte-sector, and still have the cache fetch an entire sector on read miss (prefetching the entire sector on miss), but only make valid+dirty a single block on a write miss while leaving all other blocks within a sector invalid on a write miss. This scheme would suppress reads needed to fill out the block size of greater than 8 bytes on a write miss. How would you expect the simulation results from the 512K cache simulation to change if dinero could simulate this arrangement (which it can't for implementation reasons) -- specifically what would the new number of bus cycles would be for this "smart sectoring" approach? (Hints: if you want to understand the write behavior better try a dinero simluation changed to an 8-byte block size and think about the results; but realize that this isn't exactly what this problem is asking. This problem requires you to look at your dinero output and see how it would differ if "smart sectoring" were implemented.)
  5. Compute the theoretical minimum number of bus cycles if you consider only accesses to the data arrays. How much overhead (in bus cycles) is being spent on subroutine calling overhead (compare the theoretical result to the "smart sectoring" result for the 512K cache.).

3. Use the following simplified timing model and compute the effective memory access time for data accesses on the three examples in part 1 of this problem, showing them in a table. (We're going to ignore instruction memory for the purposes of illustrating a point). Which option is the fastest? This timing model has specifically been simplified to ignore any data dependencies or other fine-grain timing effects; instead it is more of a "plumbing" model dealing with gross data flow rates. This problem asks you to consider 2 ways in which you might be limited in each of the three configurations: memory access latency or bus bandwidth. Either effective memory access time (as in a total for the whole program) or average effective memory access time are fine.

4. Model a two-level cache with the parameters below. Note that the L1 cache can now have a 32-byte bus instead of the 16-byte bus used in earlier 8K cache examples because it is connected to an L2 cache instead of to a memory bus. Report the dinero command line, data read miss rate, data write miss rate, and total bus usage for each cache as you did for part 1 of this problem. How much better (in clocks, not as a ratio) is this composite 2-level cache performance than the best case from part 3 of this problem? Use the simplified timing model below.

L1 cache: dinero -i8K -d8K -a4 -ww -An -b32 -W8 -B32
L2 cache: dinero -u512K -a4 -wc -Aw -b32 -W8 -B16

Problem 2: Context Switching

In this problem you'll simulate what happens when multiple programs alternate in accessing cache memory. Although this problem does not include measurements of direct context switching overhead (e.g., register saves and restores), it does give a feel for the indirect context switching costs caused by inter-task cache conflict misses. This problem assumes that you have a logical link in place to the lab4 subdirectory as in problem 1, and uses three traces in the lab4/traces directory. There is nothing special about these traces -- they were arbitrarily selected from the H&P trace distributions.

1. Run the program mtask.c while varying the tasking frequency and cache size, then feed the outputs to dinero. This takes a number of accesses and a list of files as input parameters. It then reads the specified number of memory access from the first file, then from the second file, and so on until wrapping around back to the first file and prints them to standard output. This gives a simple approximation to a multitasking environment. (mtask also alters the high bits of each address to reduce the possibility that different programs will accidentally get cache hits on memory locations belonging to other programs; virtual memory effects are ignored.) Report your results in a table for the following experimental conditions:

2. Plot a graph with the results from part 1 having four curves -- one per cache size. Put miss ratio on a logarithmic Y axis and number of accesses between switches on a logarithmic X axis. Be sure to label the curves and the axes.

3. For the data you have plotted there will be "break points" in terms of miss ratio improving noticably for any given cache size when the number of memory accesses between switches changes above a certain point. Look at your graph and state an approximate relationship for how big cache size should be for this situation in order to be to the right of the break point on the performance curve. Detailed numerical analysis is not required -- just an approximate relationship with one digit of precision. Of course it should be understood that this isn't a general rule, just a result for this particular situation.

18-548/15-548 home page.