Reminders

- **Lab 4: Due March 21 (today!)**
  - Please try to do the extra credit as well!

- **Homework 5: Due March 26**

- The course will move quickly... Keep your pace. Talk with the TAs and me if you are concerned about your performance.
Readings for Today and Next Lecture

- **Memory Hierarchy and Caches**

- Cache chapters from P&H: 5.1-5.3

- Memory/cache chapters from Hamacher+: 8.1-8.7

- An early cache paper by Maurice Wilkes
Cache Replacement Policy

- LRU vs. Random
  - **Set thrashing:** When the “program working set” in a set is larger than set associativity
  - 4-way: Cyclic references to A, B, C, D, E
    - 0% hit rate with LRU policy
  - Random replacement policy is better when thrashing occurs

- In practice:
  - Depends on workload
  - Average hit rate of LRU and Random are similar

- Hybrid of LRU and Random
  - How to choose between the two? Set sampling
Optimal Replacement Policy?

- Belady’s OPT
  - Replace the block that is going to be referenced furthest in the future by the program
  - How do we implement this? Simulate?

- Is this optimal for minimizing miss rate?
- Is this optimal for minimizing execution time?
  - No. Cache miss latency/cost varies from block to block!
  - Two reasons: Remote vs. local caches and miss overlapping
Aside: Cache versus Page Replacement

- Physical memory (DRAM) is a cache for disk
  - Usually managed by system software via the virtual memory subsystem

- Page replacement is similar to cache replacement
- Page table is the “tag store” for physical memory data store

- What is the difference?
  - Hardware versus software
  - Number of blocks in a cache versus physical memory
  - “Tolerable” amount of time to find a replacement candidate
What’s In A Tag Store Entry?

- Valid bit
- Tag
- Replacement policy bits

- Dirty bit?
  - Write back vs. write through caches
Handling Writes (Stores)

- When do we write the modified data in a cache to the next level?
  - **Write through**: At the time the write happens
  - **Write back**: When the block is evicted

- **Write-back**
  - Can consolidate multiple writes to the same block before eviction
    - Potentially saves bandwidth between cache levels + saves energy
  - Need a bit in the tag store indicating the block is “modified”

- **Write-through**
  - Simpler
  - All levels are up to date. **Consistency**: Simpler cache coherence because no need to check lower-level caches
  - More bandwidth intensive; no coalescing of writes
Handling Writes (Stores)

- Do we allocate a cache block on a write miss?
  - Allocate on write miss: Yes
  - No-allocate on write miss: No

- Allocate on write miss
  + Can consolidate writes instead of writing each of them individually to next level
  + Simpler because write misses can be treated the same way as read misses
  -- Requires (?) transfer of the whole cache block

- No-allocate
  + Conserves cache space if locality of writes is low (potentially better cache hit rate)
Sectored Caches

- Divide a block into subblocks (or sectors)
  - Have separate valid and dirty bits for each sector
  - When is this useful? (Think writes…)
  - How many subblocks do you transfer on a read?

++ No need to transfer the entire cache block into the cache
   (A write simply validates and updates a subblock)
++ More freedom in transferring subblocks into the cache (a
cache block does not need to be in the cache fully)

-- More complex design
-- May not exploit spatial locality fully when used for reads

| v | d | subblock | v | d | subblock | •••• | v | d | subblock | tag |
Instruction vs. Data Caches

- Unified:
  + Dynamic sharing of cache space: no overprovisioning that might happen with static partitioning (i.e., split I and D caches)
  -- Instructions and data can thrash each other (i.e., no guaranteed space for either)
  -- I and D are accessed in different places in the pipeline. Where do we place the unified cache for fast access?

- First level caches are almost always split
  - Mainly for the last reason above

- Second and higher levels are almost always unified
Multi-level Caching in a Pipelined Design

- First-level caches (instruction and data)
  - Decisions very much affected by cycle time
  - Small, lower associativity
  - Tag store and data store accessed in parallel

- Second-level caches
  - Decisions need to balance hit rate and access latency
  - Usually large and highly associative; latency not as important
  - Tag store and data store accessed serially

- Serial vs. Parallel access of levels
  - Serial: Second level cache accessed only if first-level misses
  - Second level does not see the same accesses as the first
    - First level acts as a filter
Virtual Memory and Cache Interaction
Address Translation and Caching

- When do we do the address translation?
  - Before or after accessing the L1 cache?

- In other words, is the cache virtually addressed or physically addressed?
  - Virtual versus physical cache

- What are the issues with a virtually addressed cache?

- Synonym problem:
  - Two different virtual addresses can map to the same physical address → same physical address can be present in multiple locations in the cache → can lead to inconsistency in data
Homonyms and Synonyms

- **Homonym**: Same VA can map to two different PAs
  - Why?
    - VA is in different processes

- **Synonym**: Different VAs can map to the same PA
  - Why?
    - Different pages can share the same physical frame within or across processes
    - Reasons: shared libraries, shared data, copy-on-write pages within the same process, ...

- Do homonyms and synonyms create problems when we have a cache?
  - Is the cache virtually or physically addressed?
Cache-VM Interaction

CPU → TLB → cache → lower hier.

CPU → cache → lower hier.

CPU → cache → tlb → lower hier.

physical cache  virtual (L1) cache  virtual-physical cache
Physical Cache

PIPT cache (Physical cache)

VA

TLB

physical

index

physical

PA

A physical address can be only one location in cache

=?

hit?
Virtual Cache
Virtual-Physical Cache

VIPT code

TLB

PFN

Where can the same physical address be in the code?
Virtually-Indexed Physically-Tagged

- If \( \text{C} \leq (\text{page\_size} \times \text{associativity}) \), the cache index bits come only from page offset (same in VA and PA).
- If both cache and TLB are on chip
  - index both arrays concurrently using VA bits
  - check cache tag (physical) against TLB output at the end

![Diagram showing the relationship between VPN, Page Offset, TLB, PPN, cache, and data.]
Virtually-Indexed Physically-Tagged

- If $C > (\text{page\_size} \times \text{associativity})$, the cache index bits include VPN
  - Synonyms can cause problems
    - The same physical address can exist in two locations

- Solutions?

Diagram:

- TLB
- VPN
- Page Offset
- Index
- BiB
- physical cache
- PPN
- tag
- data
- TLB hit?
- cache hit?
Some Solutions to the Synonym Problem

- Limit cache size to (page size times associativity)
  - get index from page offset

- On a write to a block, search all possible indices that can contain the same physical block, and update/invalidate
  - Used in Alpha 21264, MIPS R10K

- Restrict page placement in OS
  - make sure index(VA) = index(PA)
  - Called page coloring
  - Used in many SPARC processors
An Exercise

- Problem 5 from
  - ECE 741 midterm exam Problem 5, Spring 2009
An Exercise (I)

We have a byte-addressable toy computer that has a physical address space of 512 bytes. The computer uses a simple, one-level virtual memory system. The page table is always in physical memory. The page size is specified as 8 bytes and the virtual address space is 2 KB.

**Part A.**

i.  (1 point)
How many bits of each virtual address is the virtual page number?

ii. (1 point)
How many bits of each physical address is the physical frame number?
We would like to add a 128-byte write-through cache to enhance the performance of this computer. However, we would like the cache access and address translation to be performed simultaneously. In other words, we would like to index our cache using a virtual address, but do the tag comparison using the physical addresses (virtually-indexed physically-tagged). The cache we would like to add is direct-mapped, and has a block size of 2 bytes. The replacement policy is LRU. Answer the following questions:

iii. (1 point)
How many bits of a virtual address are used to determine which byte in a block is accessed?

iv. (2 point)
How many bits of a virtual address are used to index into the cache? Which bits exactly?

v. (1 point)
How many bits of the virtual page number are used to index into the cache?

vi. (5 points)
What is the size of the tag store in bits? Show your work.
Suppose we have two processes sharing our toy computer. These processes share some portion of the physical memory. Some of the virtual page-physical frame mappings of each process are given below:

<table>
<thead>
<tr>
<th>PROCESS 0</th>
<th>PROCESS 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtual Page</td>
<td>Physical Frame</td>
</tr>
<tr>
<td>Page 0</td>
<td>Frame 0</td>
</tr>
<tr>
<td>Page 3</td>
<td>Frame 7</td>
</tr>
<tr>
<td>Page 7</td>
<td>Frame 1</td>
</tr>
<tr>
<td>Page 15</td>
<td>Frame 3</td>
</tr>
</tbody>
</table>

vii. (2 points)  
Give a complete physical address whose data can exist in two different locations in the cache.

viii. (3 points)  
Give the indexes of those two different locations in the cache.
ix.  (5 points)
We do not want the same physical address stored in two different locations in the 128-byte cache. We can prevent this by increasing the associativity of our virtually-indexed physically-tagged cache. What is the minimum associativity required?

x.  (4 points)
Assume we would like to use a direct-mapped cache. Describe a solution that ensures that the same physical address is never stored in two different locations in the 128-byte cache.
Solutions to the Exercise


- And, more exercises are in past exams and in your homeworks...
Review: Solutions to the Synonym Problem

- Limit cache size to \((\text{page size times associativity})\)
  - get index from page offset

- On a write to a block, search all possible indices that can contain the same physical block, and update/invalidate
  - Used in Alpha 21264, MIPS R10K

- Restrict page placement in OS
  - make sure \(\text{index(VA)} = \text{index(PA)}\)
  - Called page coloring
  - Used in many SPARC processors
Some Questions to Ponder

- At what cache level should we worry about the synonym and homonym problems?

- What levels of the memory hierarchy does the system software’s page mapping algorithms influence?

- What are the potential benefits and downsides of page coloring?
Virtual Memory – DRAM Interaction

- Operating System influences where an address maps to in DRAM

- Operating system can control which bank/channel/rank a virtual page is mapped to.

- It can perform page coloring to minimize bank conflicts
- Or to minimize inter-application interference
Cache Performance
Cache Parameters vs. Miss Rate

- Cache size
- Block size
- Associativity
- Replacement policy
- Insertion/Placement policy
Cache Size

- **Cache size**: total data (not including tag) capacity
  - bigger can exploit temporal locality better
  - not ALWAYS better

- Too large a cache adversely affects hit and miss latency
  - smaller is faster => bigger is slower
  - access time may degrade critical path

- Too small a cache
  - doesn’t exploit temporal locality well
  - useful data replaced often

- **Working set**: the whole set of data the executing application references
  - Within a time interval
Block Size

- Block size is the data that is associated with an address tag
  - not necessarily the unit of transfer between hierarchies
    - **Sub-blocking**: A block divided into multiple pieces (each with V bit)
      - Can improve “write” performance

- Too small blocks
  - don’t exploit spatial locality well
  - have larger tag overhead

- Too large blocks
  - too few total # of blocks
    - likely-useless data transferred
    - Extra bandwidth/energy consumed
Large Blocks: Critical-Word and Subblocking

- Large cache blocks can take a long time to fill into the cache
  - fill cache line critical word first
  - restart cache access before complete fill

- Large cache blocks can waste bus bandwidth
  - divide a block into subblocks
  - associate separate valid bits for each subblock
  - When is this useful?

```
v    d    subblock   v    d    subblock   •••••• v    d    subblock   tag
```
Associativity

- How many blocks can map to the same index (or set)?

- Larger associativity
  - lower miss rate, less variation among programs
  - diminishing returns, higher hit latency

- Smaller associativity
  - lower cost
  - lower hit latency
    - Especially important for L1 caches

- Power of 2 associativity?
Classification of Cache Misses

- **Compulsory miss**
  - first reference to an address (block) always results in a miss
  - subsequent references should hit unless the cache block is displaced for the reasons below
  - dominates when locality is poor

- **Capacity miss**
  - cache is too small to hold everything needed
  - defined as the misses that would occur even in a fully-associative cache (with optimal replacement) of the same capacity

- **Conflict miss**
  - defined as any miss that is neither a compulsory nor a capacity miss
How to Reduce Each Miss Type

- Compulsory
  - Caching cannot help
  - Prefetching

- Conflict
  - More associativity
  - Other ways to get more associativity without making the cache associative
    - Victim cache
    - Hashing
    - Software hints?

- Capacity
  - Utilize cache space better: keep blocks that will be referenced
  - Software management: divide working set such that each “phase” fits in cache
Improving Cache “Performance”

- Remember
  - Average memory access time (AMAT)
    \[ \text{AMAT} = ( \text{hit-rate} \times \text{hit-latency} ) + ( \text{miss-rate} \times \text{miss-latency} ) \]

- Reducing miss rate
  - Caveat: reducing miss rate can reduce performance if more costly-to-refetch blocks are evicted

- Reducing miss latency/cost

- Reducing hit latency
Improving Basic Cache Performance

- Reducing miss rate
  - More associativity
  - Alternatives/enhancements to associativity
    - Victim caches, hashing, pseudo-associativity, skewed associativity
  - Better replacement/insertion policies
  - Software approaches

- Reducing miss latency/cost
  - Multi-level caches
  - Critical word first
  - Subblocking/sectoring
  - Better replacement/insertion policies
  - Non-blocking caches (multiple cache misses in parallel)
  - Multiple accesses per cycle
  - Software approaches
Victim Cache: Reducing Conflict Misses


- Idea: Use a small fully associative buffer (victim cache) to store evicted blocks
  - Can avoid ping ponging of cache blocks mapped to the same set (if two cache blocks continuously accessed in nearby time conflict with each other)
  - Increases miss latency if accessed serially with L2
Hashing and Pseudo-Associativity

- Hashing: Better “randomizing” index functions
  + can reduce conflict misses
    - by distributing the accessed memory blocks more evenly to sets
  - Example: stride where stride value equals cache size
    -- More complex to implement: can lengthen critical path

- Pseudo-associativity (Poor Man’s associative cache)
  - Serial lookup: On a miss, use a different index function and access cache again
  - Given a direct-mapped array with K cache blocks
    - Implement K/N sets
    - Given address Addr, sequentially look up: \{0, Addr[\log(K/N)-1: 0]\}, \{1, Addr[\log(K/N)-1: 0]\}, ..., \{N-1, Addr[\log(K/N)-1: 0]\}
Skewed Associative Caches (I)

- Basic 2-way associative cache structure

```
Way 0
---
<table>
<thead>
<tr>
<th>Tag</th>
<th>Index</th>
<th>Byte in Block</th>
</tr>
</thead>
</table>

Same index function for each way

Way 1
---
<table>
<thead>
<tr>
<th>Tag</th>
<th>Index</th>
<th>Byte in Block</th>
</tr>
</thead>
</table>

=?
```

44
Skewed Associative Caches (II)

- Skewed associative caches
  - Each bank has a different index function

![Diagram](image-url)
Skewed Associative Caches (III)

- Idea: Reduce conflict misses by using different index functions for each cache way

- Benefit: indices are randomized
  - Less likely two blocks have same index
    - Reduced conflict misses
  - May be able to reduce associativity

- Cost: additional latency of hash function

Improving Hit Rate via Software (I)

- Restructuring data layout
- Example: If column-major
  - $x[i+1,j]$ follows $x[i,j]$ in memory
  - $x[i,j+1]$ is far away from $x[i,j]$

<table>
<thead>
<tr>
<th>Poor code</th>
<th>Better code</th>
</tr>
</thead>
<tbody>
<tr>
<td>for $i = 1$, rows</td>
<td>for $j = 1$, columns</td>
</tr>
<tr>
<td>for $j = 1$, columns</td>
<td>for $i = 1$, rows</td>
</tr>
<tr>
<td>sum = sum + $x[i,j]$</td>
<td>sum = sum + $x[i,j]$</td>
</tr>
</tbody>
</table>

- This is called **loop interchange**
- Other optimizations can also increase hit rate  
  - Loop fusion, array merging, ...
- What if multiple arrays? Unknown array size at compile time?
More on Data Structure Layout

- Pointer based traversal (e.g., of a linked list)
- Assume a huge linked list (1M nodes) and unique keys
- Why does the code on the left have poor cache hit rate?
  - “Other fields” occupy most of the cache line even though rarely accessed!

```c
struct Node {
    struct Node* node;
    int key;
    char [256] name;
    char [256] school;
}

while (node) {
    if (node->key == input-key) {
        // access other fields of node
    }
    node = node->next;
}
```
How Do We Make This Cache-Friendly?

- **Idea:** separate frequently-used fields of a data structure and pack them into a separate data structure

- **Who should do this?**
  - Programmer
  - Compiler
    - Profiling vs. dynamic
  - Hardware?
  - Who can determine what is frequently used?

```c
struct Node {
    struct Node* node;
    int key;
    struct Node-data* node-data;
}

struct Node-data {
    char [256] name;
    char [256] school;
}

while (node) {
    if (node->key == input-key) {
        // access node->node-data
    }
    node = node->next;
}
```
Blocking

- Divide loops operating on arrays into computation chunks so that each chunk can hold its data in the cache
- Avoids cache conflicts between different chunks of computation
- Essentially: Divide the working set so that each piece fits in the cache

But, there are still self-conflicts in a block
1. there can be conflicts among different arrays
2. array sizes may be unknown at compile/programming time
Improving Basic Cache Performance

- Reducing miss rate
  - More associativity
  - Alternatives/enhancements to associativity
    - Victim caches, hashing, pseudo-associativity, skewed associativity
  - Better replacement/insertion policies
  - Software approaches

- Reducing miss latency/cost
  - Multi-level caches
  - Critical word first
  - Subblocking/sectoring
  - Better replacement/insertion policies
  - Non-blocking caches (multiple cache misses in parallel)
  - Multiple accesses per cycle
  - Software approaches
Memory Level Parallelism (MLP) means generating and servicing multiple memory accesses in parallel [Glew’ 98].

Several techniques to improve MLP (e.g., out-of-order execution).

MLP varies. Some misses are isolated and some parallel.

How does this affect cache replacement?
Traditional Cache Replacement Policies

- Traditional cache replacement policies try to reduce miss count.

- **Implicit assumption**: Reducing miss count reduces memory-related stall time.

- Misses with varying cost/MLP **breaks** this assumption!

- Eliminating an isolated miss helps performance more than eliminating a parallel miss.

- Eliminating a higher-latency miss could help performance more than eliminating a lower-latency miss.
An Example

Misses to blocks P1, P2, P3, P4 can be parallel
Misses to blocks S1, S2, and S3 are isolated

Two replacement algorithms:
  1. Minimizes miss count (Belady’s OPT)
  2. Reduces isolated miss (MLP-Aware)

For a fully associative cache containing 4 blocks
Fewest Misses ≠ Best Performance

Belady’s OPT replacement

MLP-Aware replacement
MLP-Aware Cache Replacement

- How do we incorporate MLP into replacement decisions?

  - Required reading for this week
Enabling Multiple Outstanding Misses
Handling Multiple Outstanding Accesses

- **Non-blocking or lockup-free caches**

- **Question:** If the processor can generate multiple cache accesses, can the later accesses be handled while a previous miss is outstanding?

- **Idea:** Keep track of the status/data of misses that are being handled in Miss Status Handling Registers (MSHRs)
  - A cache access checks MSHRs to see if a miss to the same block is already pending.
    - If pending, a new request is not generated
    - If pending and the needed data available, data forwarded to later load
  - Requires buffering of outstanding miss requests
Non-Blocking Caches (and MLP)

- Enable cache access when there is a pending miss
- Enable multiple misses in parallel
  - Memory-level parallelism (MLP)
    - generating and servicing multiple memory accesses in parallel
  - Why generate multiple misses?
    - Enables latency tolerance: overlaps latency of different misses
- How to generate multiple misses?
  - Out-of-order execution, multithreading, runahead, prefetching
Miss Status Handling Register

- Also called “miss buffer”
- Keeps track of
  - Outstanding cache misses
  - Pending load/store accesses that refer to the missing cache block
- Fields of a single MSHR entry
  - Valid bit
  - Cache block address (to match incoming accesses)
  - Control/status bits (prefetch, issued to memory, which subblocks have arrived, etc)
  - Data for each subblock
  - For each pending load/store
    - Valid, type, data size, byte in block, destination register or store buffer entry address
### Miss Status Handling Register Entry

<table>
<thead>
<tr>
<th>Valid</th>
<th>Block Address</th>
<th>Issued</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>27</td>
<td>1</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Load/store 0</th>
<th>Load/store 1</th>
<th>Load/store 2</th>
<th>Load/store 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Valid</td>
<td>Type</td>
<td>Block Offset</td>
<td>Destination</td>
</tr>
<tr>
<td>Valid</td>
<td>Type</td>
<td>Block Offset</td>
<td>Destination</td>
</tr>
<tr>
<td>Valid</td>
<td>Type</td>
<td>Block Offset</td>
<td>Destination</td>
</tr>
<tr>
<td>Valid</td>
<td>Type</td>
<td>Block Offset</td>
<td>Destination</td>
</tr>
</tbody>
</table>
MSHR Operation

■ On a cache miss:
  □ Search MSHRs for a pending access to the same block
    ■ Found: Allocate a load/store entry in the same MSHR entry
    ■ Not found: Allocate a new MSHR
    ■ No free entry: stall

■ When a subblock returns from the next level in memory
  □ Check which loads/stores waiting for it
    ■ Forward data to the load/store unit
    ■ Deallocate load/store entry in the MSHR entry
  □ Write subblock in cache or MSHR
  □ If last subblock, deallocate MSHR (after writing the block in cache)
Non-Blocking Cache Implementation

- When to access the MSHRs?
  - In parallel with the cache?
  - After cache access is complete?

- MSHRs need not be on the critical path of hit requests
  - Which one below is the common case?
    - Cache miss, MSHR hit
    - Cache hit
Enabling High Bandwidth Caches
(and Memories in General)
Multiple Instructions per Cycle

- Can generate multiple cache accesses per cycle
- How do we ensure the cache can handle multiple accesses in the same clock cycle?

Solutions:
- true multi-porting
- virtual multi-porting (time sharing a port)
- multiple cache copies
- banking (interleaving)
Handling Multiple Accesses per Cycle (I)

- **True multiporting**
  - Each memory cell has multiple read or write ports
    + Truly concurrent accesses (no conflicts regardless of address)
    -- Expensive in terms of latency, power, area
  - What about read and write to the same location at the same time?
    - Peripheral logic needs to handle this
Peripheral Logic for True Multiporting

Diagram showing a peripheral logic setup for true multiporting, including CPU or I/O, semaphore cells, and dual/port RAM memory cells.
Peripheral Logic for True Multiporting
Handling Multiple Accesses per Cycle (I)

- **Virtual multiporting**
  - Time-share a single port
  - Each access needs to be (significantly) shorter than clock cycle
  - Used in Alpha 21264
  - Is this scalable?
Handling Multiple Accesses per Cycle (II)

- **Multiple cache copies**
  - Stores update both caches
  - Loads proceed in parallel

- **Used in Alpha 21164**

- **Scalability?**
  - Store operations form a bottleneck
  - Area proportional to “ports”
Handling Multiple Accesses per Cycle (III)

- **Banking (Interleaving)**
  - Bits in address determines which bank an address maps to
    - Address space partitioned into separate banks
    - Which bits to use for “bank address”?
  - + No increase in data store area
  - -- Cannot satisfy multiple accesses to the same bank
  - -- Crossbar interconnect in input/output

- **Bank conflicts**
  - Two accesses are to the same bank
  - How can these be reduced?
    - Hardware? Software?
General Principle: Interleaving

- **Interleaving (banking)**
  - **Problem**: A single monolithic memory array takes long to access and does not enable multiple accesses in parallel
  - **Goal**: Reduce the latency of memory array access and enable multiple accesses in parallel
  - **Idea**: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles)
    - Each bank is smaller than the entire memory storage
    - Accesses to different banks can be overlapped
  - **Issue**: How do you map data to different banks? (i.e., how do you interleave data across banks?)