# Lab 1: Cache Size

Due Friday, September 4, 1998

Problem 1.
Exercise the Dinero cache simulation tool with some pre-generated address traces. The traces can be found at:
/afs/ece.cmu.edu/class/ece548/www/hw/lab1
There are three traces as follows. Because these traces are a bit large, you may want to use them from where they are by logically linking to them rather than copying them to your local directory.

File Name Size(bytes)
a.din 14417700
b.din 14417700
c.din 14417700

The traces capture data accesses made from a program run (instruction accesses were ignored, as were any memory accesses from the main() procedure). They are already in a format suitable for being fed directly into dinero and contain 32-bit trace information (so they can be correctly processed by Dinero running on any 32-bit Unix platform). The three traces are data accesses performed by the three toy benchmark programs (but, at this point we're not telling you which trace is for which program):

Note: for any short-answer questions, it shouldn't take more than about 50 words to answer the question; long essays are neither required nor desired at any time in this course unless specifically asked for.

1. For each trace, run Dinero using the following parameters for a variety of cache sizes:
• "Block" size (really "sector size" in our terminology) 4 bytes
• Unified cache (doesn't really matter, since we're only looking at data accesses)
• All other parameters to default
Perform simulation runs for cache sizes varying by a factor of two: 256 bytes, 512 bytes, 1K bytes, and so on up to and including 1M byte (1048576 bytes). List the miss rates for each run in a table by cache size and trace name.
2. Take the Dinero results and plot all three curves in a single graph. Place miss rate on a linear Y axis and cache size on a logarithmic X axis. (Note: freehand-sketched graphs are not acceptable; either use graph paper or use graphing software).
3. Which trace goes with which program? Back up each pairing with a short reason. The answers should be based on reasoning about program behavior with respect to cache memory. Basing your answer on a simulation result that you perform yourself is not acceptable; neither is justifying the third pairing by process of elimination.
4. For the given parameters for each trace, what is the minimum size to make the cache to capture essentially all of the locality and achieve a low miss rate? (Provide three distinct answers, one for each trace.) Back up each answer with a short reason.

Problem 2:
Experimentally determine the size of first-level cache on two Unix platforms (one platform being the course Alphastations and one that does not have an Alpha CPU). A starting point for your experiment can be found at cachsize.c Be sure to compile all your programs with the C compiler optimization turned on!

This program sequentially reads all the elements from a data array. It does this in two places -- first to pre-load the array into cache to eliminate any compulsory misses. Second it reads the data in a timed loop that repeatedly accesses the array to determine what the sustained access speed to the array will be (presumably faster if the array is small enough to be completely resident in cache). The number of times the array is read is scaled according to the array size to keep the total number of memory accesses relatively constant no matter what the array size. Thus, the relevant results are in MBytes/second read, not the actual execution time.

You will need to modify the program so that the timed loop takes approximately 10 seconds of execution time so that the experimental noise from the timer quantization is low. (Anything between 10 and 100 seconds is fine.) Your best bet is to use or modify the loop limit tuning mechanism used in the example; querying the timer each time through the loop will disturb data cache contents.

Compile your modified program and obtain times for array sizes in 1 KB increments from 1 KB to 40 KB. (On some machines you may need to go higher; if so you can use a larger spread between data points; use your judgement but take data up to approximately 2.5 times the L1 cache size). Repeat the below steps for both machines; you may plot two curves on the same graph as long as they are color-coded or otherwise obviously distinguishable.

1. Plot the results on a graph with bytes/sec on a linear Y axis and array size on a linear X axis. Include machine information in the graph title (CPU model if you can tell, otherwise machine make/model and the machine hostname). Also annotate the graph with the C compiler name and flags you used.
2. Draw an arrow to the graph feature that indicates the Level-1 cache size on your machine. What is the L1 cache size?
3. There should be 3 different "regions" of performance in your graph. Label the regions of: nearly 100% L1 cache hits; some hits+some misses; maximal number of misses (it won't be 100% misses because of spatial locality for block size > integer size).
4. What was the widest possible span that you would expect for the "some hits+some misses" region? In round numbers, what fraction of this width did your experimental data take up (e.g., all of it, one third of it). Will discuss in a later class why it might not take up the entire maximum possible width.

If you don't have access to a second workstation type, log in to: unix.andrew.cmu.edu with the I.D. of butler and obtain a list of idle Andrew workstations.