18-548/15-548 Fall 1998

Lab 3: Cache Simulation &
Varying Cache Parameters

Due Friday, October 9, 1998 at 3:00 PM

This lab takes you through a simulation-based exploration of the effects of cache parameters on performance for a benchmark program. In this case the benchmark is going to be cjpeg, which performs image compression and outputs files in jpeg format. cjpeg is a multimedia application, although has fairly good overall cache behavior compared to some. (If you're curious about jpeg, there is a FAQ entry available on it.) Note that there are two reasons to perform such simulations -- for system design and for performance improvement of the application.

The homework-specific files you will need are in /afs/ece/class/ece548/www/hw/lab3 . You must establish a logical link there for the command lines in this homework to work properly:
ln -s /afs/ece/class/ece548/www/hw/lab3
You are welcome to use and adapt any shell or perl scripts you find in that directory to help you with the homework.

You are strongly urged to get an early start on this lab. Running simulations is a crucial part of being a computer architect, and thinking about what the simulations are doing (rather than simply punching through a set of parameters) will help you understand the concepts we are covering in class. You may want to set up a file using the "at" facility ("man at" will give more info) and run your simulations overnight, testing them first to make sure they're likely to "go".

Problem 1: Compare with H&P traces

In this problem we perform a cursory comparison of cjpeg to three other programs in the H&P trace collection. You'll get to see whether this image compression program "looks" any different from so-called general purpose computing.

The H&P traces are fixed in size and content, but you will be able to generate traces on-the-fly for cjpeg because it has been "atom-ized". In other words, an instrumentation package called Atom has been used to augment the cjpeg program with printf statements that spit out trace information suitable for use by dinero on every instruction fetch, data load, and data store. You'll be using atom yourself in a coming homework. But for now we've done that for you.

1. In order to use cjpeg you'll need an image file, which for this homework must be in .gif format. You have an image assigned to you in the directory lab3/gif (using a rather obvious naming and image content scheme). Please don't modify the file in any way before using it. Save this image your working directory and run cjpeg on it (where "fname" is the file name you got from the archive):
lab3/cjpeg -outfile fname.jpg fname.gif
This will translate the .gif file into a .jpg file. Load the .jpg file into Netscape or otherwise view it in some program and print it out (B&W printing is fine). Clearly annotate it with the filename and the file sizes of both the .gif and .jpg version (a .jpg file is typically less than half the size, but this varies significantly depending on the picture).

2. Let's get some idea of what cjpeg is doing to perform this compression. cjpeg.atom is the instrumented version of the program that produces a full dinero trace file. Run this instrumented version through dinero and report the instruction miss ratio, data read miss ratio, data write miss ratio, and overall traffic ratio using the following command line (for this and all other problems, use the graphics image selected in part 1 of this problem; obviously the below command should be entered as a single line).
lab3/cjpeg.atom -outfile fname.jpg fname.gif | dinero -i8K -d8K -b16 -W8 -B8 -z1000000000
Cut & paste the dinero result from this command line; circle and label the requested 4 pieces of information. If you have even the slightest doubt whether you are looking at the correct numbers ask for help from the course staff. (Note: don't try to pipe the trace output from cjpeg.atom to a file in your working directory -- it is likely to be up to a GB in size).

3. Now let's compare this to other traces. Run the same dinero parameters on the following three dinero input files in the homework directory: cc1.din, spice.din, tex.din. Create two tables indicating for each of the three programs and cjpeg:
Table 1: number of instructions, number of reads, number of writes, total demands (sum of all demand references)
Table 2: overall miss ratio, instruction miss ratio, data read miss ratio, data write miss ratio, traffic ratio.
You'll want to use command lines for the other traces that look slightly different, since these are 32-bit programs with 4-byte accesses, and are in general pretty short traces. In particular, use:
dinero -i8K -d8K -b16 < lab3/cc1.din
dinero -i8K -d8K -b16 < lab3/spice.din
dinero -i8K -d8K -b16 < lab3/tex.din
Compared to the three H&P traces, how well behaved is cjpeg (what is the rank of its cache performance among the four cases in your tables)?

Problem 2. Block Size

The classical way to run a cache experiment is to pick a starting point as we did above and then do sensitivity analysis to see which parameters matter and which don't. In this problem we'll look at block size.

In order to keep run times tractible, there is a slightly differently instrumented version of the jpeg program available called cjpeg.atomd which differs in that it does not include instruction accesses in the output trace. So, use that for the following problems just as you used cjpeg.atom above. In actuality the results could still be "real", because you can think of it as an experiment to determine what memory access patterns you'd see if you were building an ASIC to do this operation -- only data accesses.

It is recommended that if there is room on the local scratch volume you run cjpeg.atomd once and save the results in a file on /scratch to feed to dinero. BUT, please be sure to erase this temporary file when you're done to make room for others. The file should be on the order of 100 MB in size; the scratch volumes hold 500 MB to 1 GB depending on the machine. NOTE: scratch volume files are not backed up, and are automatically deleted after 24 hours or so.

1. Run dinero while varying the block size from 8 bytes to 4096 bytes in increments of a factor of two (i.e., 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096). Tabulate the results as block size, data miss ratio, and traffic ratio (note that there will be no instructions in the trace you use for this). Other than block size, use the dinero command line below, which makes sure you use the whole trace. Obviously you'll have to pipe your data file or the cjpeg.atomd output into dinero for this to work.
dinero -i8K -d8K -b16 -W8 -B8 -z1000000000

2. Plot the block size results with block size on a logarithmic X axis and both ratios on the same Y axis (use the Y axis that works best to show the data). Which two or three block sizes look attractive and why? Which type of miss (which of the three "C"s) dominates at the right hand side of your graph (block size 4096)?

Problem 3. Associativity

Now let's look at associativity, still using the data-only trace from cjpeg.atomd as in Problem 2.

1. Run dinero while varying the associativity, using values of {1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18). Note that you will have to round the cache size up or down a block to hold an integral number of blocks in the cache. The desired approach is not to make the cache size one convenient to physically realize, but rather do an approximate apples-to-apples comparison of associativity with essentially identical cache sizes (i.e., cache size should not vary from 8192 by more than a few bytes, and in many cases will not be a power of two; as an example for 3-way set associative cache you would use -d8208 because that holds an integral number of 16-byte blocks and is divisible by three). Use the same command line as in Problem 2, (dinero -i8K -d8K -b16 -W8 -B8 -z1000000000) but vary associativity. Tabulate the resultant cache size, associativity, overall data miss ratio, and traffic ratio.

2. Plot the associativity results with associativity on a linear X axis and both ratios on the same graph using a Y axis that makes sense. If it were physically realizable (i.e., ignore the fact that you don't have a number of sets that is an even power of 2, and therefore addressing the memory array would be painful), would you prefer the system simulated having 3-way set associativity or the 4-way set associativity system? Why?

3. Run another simulation with 4-way set associative cache just under 8KB instead of exactly 8KB in size (i.e., -d8128 -i8128). This cache is a little smaller, and yet it has a better miss rate. What insight does this give you about the shape of the curve you just plotted in part 2 of this problem? (i.e., make a brief statement about program behavior with respect to how it is accessing the cache.)

Problem 4. Write Policies

Now let's look at write policies, and see if they matter.

1. Again run data-only simulations that test all four combinations of write-through/write-back and write-allocation/write-no-allocation given the same base command line from Problem 2 (dinero -i8K -d8K -b16 -W8 -B8 -z1000000000). Show the overall data miss ratio and traffic ratio data in a table. Assuming equal implementation cost, which combination is best?

18-548/15-548 home page.