# Homework 6: Cache Memory Size/Speed Tradeoff

Due Wednesday October 14, 1998

Multilevel Caches

Problem 1:

You have a computer with two levels of cache memory and the following specifications:

• CPU Clock: 330 MHz
• Bus: 64-bit data transfers at 66 MHz
• Processor: 32-bit RISC scalar CPU, single data address maximum per instruction
• L1 cache on-chip, 1 CPU cycle access
• block size = 32 bytes
• 1 block/sector
• split I & D cache each single-ported with one block available for access, non-blocking
• L2 cache off-chip, 3 CPU cycles transport time (L1 miss penalty)
• block size = 32 bytes
• 1 block/sector
• unified single-ported cache, blocking, non-pipelined
• Main memory has 15+5+5+5 CPU cycles transport time for 32 bytes (L2 miss penalty)
• Below are the results of a Dinero simulation for the L1 cache
```CMDLINE: dinero -b32 -i8K -d8K -a1 -ww -An -W8 -B8
CACHE (bytes): blocksize=32, sub-blocksize=0, wordsize=8, Usize=0, Dsize=8192, Isize=8192, bus-width=8.
POLICIES: assoc=1-way, replacement=l, fetch=d(1,0), write=w, allocate=n.
CTRL: debug=0, output=0, skipcount=0, maxcount=10000000, Q=0.

Metrics               Access Type:
(totals,fraction)     Total    Instrn   Data    Read   Write    Misc
-----------------     ------   ------  ------  ------  ------  ------
Demand Fetches        10000000 7362210 2637790 1870945 766845       0
1.0000   0.7362  0.2638  0.1871  0.0767  0.0000
Demand Misses          52206     8466   43740   36764    6976       0
0.0052   0.0011  0.0166  0.0196  0.0091  0.0000

Words From Memory      180920
( / Demand Fetches)    0.0181
Words Copied-Back      766845
( / Demand Writes)     1.0000
Total Traffic (words)  947765
( / Demand Fetches)    0.0948
```

1) What is the available (as opposed to used) sustained bandwidth:

• L1 cache bandwidth available to CPU (assuming 0% L1 misses)?
• L2 cache bandwidth available to L1 cache (assuming 0% L2 misses)?
• Main memory bandwidth available to L2 cache?

2) How long does an average instruction take to execute (in ns), assuming 1 clock cycle per instruction in the absence of memory hierarchy stalls, no write buffering at the L1 cache level, and 0% L2 miss rate?

3) A design study is performed to examine replacing the L2 cache with a victim cache. Compute a measure of speed for each alternative and indicate which is the faster solution. Assume the performance statistics are:

• L2 cache local miss ratio = 0.18
• Victim cache miss ratio = 0.23
• Victim cache transport time from L1 miss = 1 CPU clock

System Level Effects

Problem 2:

1) A Ph.D. student has snuck onto the course machines to run a long simulation. That task is suspended while a '548 student runs a cache-wiping homework problem, casing all data from the simulation to be expelled from cache. What is the approximate time penalty, in clocks, associated with refilling the caches when the simulation resumes execution? A restating of this same question is: assuming that the simulation runs to completion after it is restarted, how much longer (in clocks charged to that particular task) will it take to run than if it had not been interrupted?

• The program makes only single-word memory accesses.
• Ignore time to save and restore registers as well as time spent in the operating system doing scheduling tasks -- the point is to compute extra time spent waiting for cache misses as the simulation gets loaded back into cache via demand misses.
• Be sure to account for the fact that when refilling each cache, some of the accesses would have been misses anyway, and thus to not extend running time.
• The reason we say "approximate" is that you should compute L1, L2 & L3 penalties independently and then add them, rather than try to figure out coupling between them. Thus this is a "back-of-the-envelope" calculation rather than a precise result.
 L1 Cache L2 Cache L3 Cache Organization split unified unified Size 8KB data + 8 KB instr. 96 KB 8 MB Associativity direct mapped 3-way set direct mapped Blocks per sector 2 2 2 Words per block 4 4 4 Write policy write through write back write back Write allocation no yes yes Hit time 1 clock 4 clocks 12 clocks Total miss time 4 clocks 12 clocks 90 clocks Local miss ratio 0.13 (same for D & I) 0.04 0.02