

| Assignments                                                                    |
|--------------------------------------------------------------------------------|
| • By next class read:                                                          |
| • Cragon 3.0-3.3.1; 3.4 to top of page 166                                     |
| - You are <i>not</i> accountable for the details of the examples               |
| Supplemental reading:                                                          |
| • Hennessy & Patterson: 5.7, 5.8                                               |
| <ul> <li>http://www.cne.gmu.edu/modules/vm/submap.html</li> </ul>              |
| • Virtual Memory: Issues of Implementation ( <i>Computer</i> , June 1998)      |
| <ul> <li>Homework #2 due next Wednesday</li> </ul>                             |
| Physical Memory                                                                |
| Cache organization & access                                                    |
| Lab #1 due Friday 3:00 PM to course secretary                                  |
| • Short answers are in general better than long ones (single sentence is fine) |
| • BUT, be sure to put something down for each question asked.                  |

# Where Are We Now? Where we've been: Physical memory hierarchy CPU registers Cache Bus Main Memory Mass storage Where we're going today: Cache organization -- a key piece of the physical memory hierarchy How it works Performance metrics Where we're going next: Virtual memory architecture -- address mapping within memory hierarchy

## **Preview**

### Cache organization

- Terminology
- Data Organization
- Control status bit organization

### Conceptual cache operation

• Hits, misses, access sequences

### • Quantifying cache performance

- 3 "C"s -- different causes of cache misses
- Effective memory access time



| Cache Memory                                                          |
|-----------------------------------------------------------------------|
| • A small, high speed memory close to the CPU                         |
| • Each line of cache memory can be "occupied" or "empty"              |
| "Occupied" lines map to a memory location                             |
| Hardware cache management mechanism (potentially with software hints) |
| Typically used for both programs and data                             |
| • Also, by extension, other specialized hardware or software buffers  |
| TLB for mapping physical to virtual memory addresses                  |
| Disk sector storage                                                   |
| Network machine name to IP address translation                        |
| Read and write buffers                                                |
| Works because of locality                                             |
| • (Doesn't work when there isn't locality)                            |
| First commercial use:                                                 |







# **Major Cache Design Decisions**

• Cache Size -- in bytes

 Split/Unified -- instructions and data in same cache? Associativity -- how many sectors in each set?

Associativity -- now many sectors in each set:

- Sector/Block size -- how many bytes grouped together as a unit?
  - Cache "Block" also called Cache "Line"
- How many levels of cache?
  - Perhaps L1 cache on-chip; L2 cache on-module; L3 cache on motherboard

### Management policies

+

- Choosing victim for replacement on cache miss
  - FIFO, Least Recently Used, random
- Handling writes (when is write accomplished?; is written data cached?)
  - Write allocate: allocates a cache block if a write results in a cache miss
  - Write through: all written data goes "through" cache and onto bus
  - Write back: written data remains in cache until cache block is replaced, then to bus
- · Ability to set or over-ride policies in software









# **Per-Block Control bits**

- Control bits are per block (as opposed to tags per sector) to reduce memory accesses
- Valid: does entry contain valid data (as opposed to being "empty")
  - Could be in middle of filling a cache sector
  - Could have been invalidated by a program or by coherence protocol
- **Dirty:** is data modified from what is resident in memory?
  - "write-back" cache holds modifications in cache until line is flushed to memory
- Shared: is data shared within a multiprocessor (may be more than one bit depending on coherence scheme)?
- LRU: bit to track Least Recently Used replacement status

# **CACHE OPERATION**







| Example Cache Organization: [8,2,2,2]                               |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
|---------------------------------------------------------------------|-------------------|--------|--------|--------------|------|--------|---|--------------|------|------------|-----------------|---|------|------|--------|--------|------|------|
| <ul> <li>◆ 2-way Set Associative</li> </ul>                         |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
| ◆ [8 sets, 2 sectors, 2 blocks, 2 words] = <u>64 words</u> in cache |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
| • 2 words / block                                                   |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
| • 2 blocks / sector                                                 |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
|                                                                     |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
|                                                                     | • 2 sectors / set |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
| •                                                                   | 8 sets            | ./ (   | ent    | ire cao      | che  |        |   |              |      |            |                 |   |      |      |        |        |      |      |
| SECTOR 0 SECTOR 1                                                   |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
| SECTOR 0 SECTOR 1                                                   |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      | _      |        |      |      |
|                                                                     | BLOCK 0 BLOCK 1   |        |        |              |      |        |   |              |      |            | BLOCK 0 BLOCK 1 |   |      |      |        |        |      | 1    |
|                                                                     |                   |        |        |              |      |        |   |              |      |            | ~               |   | ~    |      | 6      |        | ~    |      |
| SET 0                                                               | TAG               | V      | D      |              |      |        | _ |              | WORD | TAG        | ۷               | D |      |      | ۷      | D      | WORD |      |
| SET 1                                                               | TAG               | V      | D      | WORD         |      | V      | D | WORD         |      | TAG        | V               | D | WORD |      | V      | D      | WORD |      |
| SET 2<br>SET 3                                                      | TAG<br>TAG        | V<br>V | D<br>D | WORD<br>WORD |      | V<br>V | D | WORD<br>WORD |      | TAG<br>TAG | V<br>V          | D | WORD |      | V<br>V | D<br>D | WORD |      |
| SET 3                                                               | TAG               | V      | D      | WORD         |      | v      | _ | WORD         |      | TAG        | v<br>v          |   | WORD |      | v<br>V | D      | WORD |      |
| SET 5                                                               | TAG               | v      | D      | WORD         |      | v      | D | WORD         |      | TAG        | v               | D | WORD |      | v      | D      | WORD |      |
| SET 6                                                               | TAG               | ۷      | D      | WORD         | WORD | ۷      | D | WORD         |      | TAG        | ۷               | D | WORD | WORD | ۷      | D      | WORD | WORD |
| SET 7                                                               | TAG               | ۷      | D      | WORD         | WORD | ۷      | D | WORD         | WORD | TAG        | ۷               | D | WORD | WORD | ۷      | D      | WORD | WORD |
|                                                                     |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |
|                                                                     |                   |        |        |              |      |        |   |              |      |            |                 |   |      |      |        |        |      |      |











# Simplified Cache Functions

### • Determine if cache hit or miss

### • If miss:

- Use policy information to determine whether a cache block must be allocated
- If allocation required, select a cache location to allocate (a "victim")
  - If selected location has modified data, copy data to lower hierarchy level
  - Set tag to map cache address to new memory level address
  - Fetch any required data from memory to fill cache word(s)

### • If read: return datum value

• If write: update datum value









9/2/98



# **Baseline Actions on Cache Access**

### Read:

- Determine if data is in cache (hit or miss)
- If miss, determine victim block for replacement
- If victim sector has any "dirty" blocks (unsaved data), flush to memory
- · Annotate tag on victim sector to refer to new address
- Fill victim block with new data (modify LRU bit; set valid bit; prefetch?)

### Write:

- Determine if data is in cache (hit or miss)
- If hit, write data to appropriate cache block
  - If write through policy, also write data to main memory
- If miss, either:
  - If no-write-allocate policy, write data to main memory
  - If write allocate policy
    - $\, \ast \,$  Perform a cache read if block size > size of written datum
    - » Write datum into cache
    - » If write through policy, also write data to main memory

# **Cache Misses**

 $P_{miss} = P_{compulsory} + P_{capacity} + P_{conflict}$ 

• Compulsory:

- Previously unreferenced blocks
- Compulsory miss ratio is that of an infinitely large cache
- Capacity:
  - Sectors that have been discarded to make room for new sectors
  - · A fully associative cache experiences only compulsory and capacity misses
- Conflict misses:
  - Set associative caches must discard a sector within a set to allocate a block on a miss, even if other sets have unused sectors
- Warm cache: process has been running "a long time"
- Cold cache: cache has been flushed
  - Some other task might have "stomped on" the cache
  - Software I/O synchronization might flush/invalidate cache







# **Cache Effectiveness Metrics**

### Access time: (LATENCY)

- t<sub>hit</sub> often 1 clock cycle for L1 cache, 2 or more clocks for L2 cache
- t<sub>miss</sub> can be tens of clock cycles in many cases
- t<sub>ea</sub> = average access time (depends on miss ratio)

### Bus traffic ratio: (BANDWIDTH) ratio of bus traffic with cache to traffic without cache

- Can be > 1 with write-back caches
- Traffic ratio can be critically important on multiprocessors, where cache is used as a filter to reduce shared bus traffic











# **Review**

- Locality: temporal, spatial, sequential
- Cache organization
  - Sets, sectors, blocks
    - [S sets, SE sectors, B blocks, W words] = S \* SE \* B \* W words in cache
  - Tags, dirty, shared, LRU
- Cache operation
  - Search for match, decide if miss
  - Select & update contents
- Cache misses: compulsory, capacity, conflict

• 
$$t_{ea} = (1 - P_{miss}) * t_{hit} + P_{miss} * t_{mis}$$

$$T = a + b \overset{\text{ac}}{\underset{e}{\nabla}} \frac{A}{-} 1 \overset{\text{o}}{\underset{e}{\div}}$$



# **Key Concepts**

### Latency

- Late select minimizes latency by accessing data & tag in parallel
- · Average cache access latency depends on compulsory, capacity, conflict misses
- Transport time depends on miss ratios & access latencies

### • Bandwidth

- Block size determines bandwidth opportunities (can move an entire block at a time, data paths permitting)
- Bus traffic ratio depends on miss rate and block size

### Concurrency

• Multiple blocks per sector permits sharing cost of tag in cache

### Balance

- Sector size (to save space) *vs.* number of sets in cache (to gain flexibility) is a key balance tradeoff that will be discussed in another lecture
- Transport time (miss rate) vs. bus traffic ratio is another key balance issue

# RECITATION MATERIAL

**Example Read Transport Time** • Embedded 32-bit CPU; cache block size of 4 x 32-bit words; s-bit memory bus at half CPU speed; 1 byte/bus cycle + 2 bus cycles overhead to start transfers • All clocks are in terms of CPU clocks • a = 4+2=6 clocks (4 clocks to reach memory + 2 clocks first byte xfer) • b = 2 clocks (CPU clock cycles twice for each bus access) • A = 4 (4 @ 32-bit words in a block) • W = 0.25 (.25 words (1 byte of 4) transferred per bus cycle)  $T = a + b \stackrel{\text{ee}}{\leftarrow} \frac{A}{-0.25} - 1 \stackrel{\text{O}}{\leftarrow} \frac{}{=} 6+ 2(15) = 36$  CPU clocks