# Homework 2

Due Wednesday September 9, 1998

Problem 1:

1) You are a DARPA program manager and someone submits a proposal for a multi-chip module for Radar signal processing. In order to provide sufficient computational power you need to process a 2048 x 2048 (= 4Mega-data-element) array of 64-bit data elements once every 100 msec.

a) Draw a "plumbing diagram" for this system and label bandwidths for each piece of "pipe" under the following assumptions:

1. There is one multi-chip module, with 8 identical CPUs bonded to it.
2. Each CPU processes exactly one-eighth of the data array every 100 msec period, with no data shared among CPUs. (So, each CPU touches all words in its one-eighth the array, and only those words.) Assume instruction accesses have a 0% cache miss ratio.
3. Each CPU has a cache with a 10% miss rate, and accesses each 8-byte word of data within its one-eight of the array 20 times per 100 msec interval.
4. There is a 64-bit data bus going from the multi-chip module to main memory, which can sustain a transfer rate of one piece of 64-bit data every clock cycle, operating at 50 MHz.
5. There are two banks of ideal memory (no inter-bank conflicts), each of which can complete a 64-bit word transfer every 20 ns.

b) What is the bottleneck to this system in terms of bandwidth, and how "big" should it be to just barely eliminate the bottleneck?

c) What is the maximum acceptable cache miss rate to eliminate the problem of the bandwidth bottleneck observed in part (b) of this question?

Problem 2:

This is an exploration of memory bandwidth versus latency. Assume that you have a processor which takes 3 clock cycles to access cache.

• Calculate and plot curves as follows. Show a table with all calculated values and an example calculation (for example, if you use Excel, include the formula for a representative cell in the spreadsheet):
• X axis is number of total clock cycles to access main memory. Plot points for every 3 clock cycles on a linear scale up to 48 clocks. (i.e., 3, 6, 9, ... , 45, 48). Note that these numbers include the time accessing cache, so a time of 3 means that all of memory is cache memory.
• Y axis is program execution time in clock cycles. Assume that the program has a total of 1 million accesses made to memory (most to cache, but some miss in cache and end up referencing main memory).
• Plot five curves assuming for each curve a different percentage of accesses miss in cache and go to main memory: 1%, 5%, 15%, 25%, 35%. For example, the 1% curve would assume that 99% of the 1 million accesses take 3 clock cycles (for cache), and 10,000 accesses (1% of 1 million) take the number of clock cycles for each point plotted on the X axis.
• Example: for the 1% curve and 15 clock cycles, the total number of clock cycles would be:
990000 * 3 + 10000 * 15 = 3120000
You are encouraged to compute these numbers and plot them with a program such as matlab or a spreadsheet such as Excel (an example spreadsheet for this problem is provided for your convenience).
• At about how many main memory access clocks (what X value) does the program run half as fast it does for a memory access time of 3 clocks (X value of 3 -- which is equivalent to all memory being as fast as cache)? Show how you compute all five answers. Check your work by looking at the graph for those points that appear on it.

Problem 3:

Let's say that you have a choice between spending money on cache bandwidth or bus bandwidth. You must choose between the following two design options:

 Design Option 1: Off-chip cache memory access takes 4 clocks (by providing 256 pins for data) Main memory access takes 24 clocks (by using a 32 bit data bus and cycling 4 times)
`   `
 Design Option 2: Off-chip cache memory access takes 6 clocks (by providing only 128 pins for data and cycling twice for each transfer) Main memory access take 16 clocks (by using a 128 bit data bus)
• Assume a 4% cache miss rate (i.e., 96% of accesses are to the off-chip cache memory, and 4% are to main memory). If you have to choose more pins for the cache or more bits on the data bus, which of the above two options will be faster and by how much in terms of clocks per average access?
• Which case would be faster with a 20% cache miss rate and by how much?