

# Assignments By next class read about cache organization and access: Hennessy & Patterson 5.1, 5.2, pp. 390-393 Cragon 2.1, 2.1.2, 2.1.3, 2.2.2 Supplemental Reading: Flynn 1.6, 5.1 Homework 1 due Wednesday September 2 Lab 1 due Friday September 4 at 3 PM.

# Where Are We Now? Where we've been: Key concepts applied to memory hierarchies Latency Bandwidth Concurrency Balance Where we're going today: Physical memory architecture -- a trip through the memory hierarchy Where we're going next: Cache organization and access Virtual memory

## **Preview**

• Principles of Locality

- Physical Memory Hierarchy:
  - CPU registers
  - Cache
  - Bus
  - Main Memory
  - Mass storage
- Bandwidth "Plumbing" diagrams











# **Spatial Locality**

 Once a word has been accessed, neighboring words are likely to be accessed

### Program structures

- Short relative branches
- Related methods in same object (e.g., constructor followed by operator)

### Data structures

- Records (C language struct)
- Operations on 1-dimensional arrays of data
- Image processing (e.g., 2-D convolution)

### Coincidental spatial locality

- · Putting important routines together for virtual memory locality
- · Overlays (manual management of pages with small address spaces)

| n onset co | <b>OVERAGE</b> (Flynn Table 3.1 | 2) |  |
|------------|---------------------------------|----|--|
| Offset     | % Branches                      | _  |  |
| ± 8        | 13%                             | -  |  |
| ± 16       | 33%                             | _  |  |
| ± 32       | 37%                             | _  |  |
| ± 64       | 46%                             | -  |  |
| ± 128      | 55%                             | -  |  |
| ± 256      | 61%                             | _  |  |
| ± 512      | 68%                             | -  |  |
| ± 1K       | 73%                             | _  |  |
| ± 2K       | 79%                             | -  |  |
| ± 4K       | 83%                             | _  |  |
| ± 8K       | 87%                             | _  |  |
| ± 16K      | 93%                             | _  |  |
| ± 32K      | 99%                             | _  |  |









## **Registers Are Software-Managed Cache**

- Registers offer highest bandwidth and shortest latency in system
  - Built into the CPU
  - Multi-ported fast access register file
- Compiler/programmer manages registers
  - · Explicit load and store instructions
  - No hardware interlocks for dependencies
- Very large register set can be effective in the right circumstances
  - · Vector computing where registers store intermediate vector results
  - · Embedded systems with multiple register sets for context switching speed
    - But may not be so great for general purpose computing



# CACHE MEMORY

# **Cache Memory**

 A small memory that provides the CPU with low latency and high bandwidth access

- Typically hardware management of which memory locations reside in cache at any point in time
- Can be entirely software managed, but this not commonly done today (may become more common on multiprocessor systems for managing coherence)

### Multiple levels of cache memory possible

• Each level is typically bigger but slower; faster levels SRAM, slower levels may be DRAM

### Bandwidth -- determined by interconnection technology

- On-chip limited by number of bits in memory cell array
- Off-chip limited by number of pins and speed at which they can be clocked
- Latency -- determined by interconnection technology & memory speed
  - · On-chip can be one or more clocks, depending on connect/cycle delays
  - Off-chip is likely to be 3-10 clocks, depending on system design







# Block size Cache is organized into blocks of data that are moved as a whole Blocks can range from 4 bytes to 256 bytes or more in size Blocks are loaded and stored as a single unit to/from the cache # blocks = cache size / block size example: 8 KB cache with 64-byte blocks: # blocks = 8192 / 64 = 128 blocks





## **Bus**

Bus is a shared system interconnection

- CPU to cache
  - Might be on system bus
  - As CPUs get more pins they move to a dedicated cache bus or all within package
- Cache to main memory
- CPU or main memory to I/O
- Typically high-speed connection, and often carries processor/memory traffic
- · Typically accessed transparently via a memory address

# **Bus Performance**

### Bandwidth -- limited by cost and transmission line effects

- 64-bit or 128-bit data bus common (but, fewer bits on cost-sensitive systems)
   Why was the 8088 used instead of the 8086 in the original IBM PC?
- Bus speed often limited to 50 66 MHz due to transmission line effects
- Example: Pentium Pro -- up to 528 MB/sec for 64-bit bus at 66 MHz

### Latency -- limited by distance and need for drivers

- Multiple clock latency, but can pipeline and achieve 1 clock/datum throughput
- Be careful about "bus clocks" vs. "processor clocks"
  - Many current processors clocked at a multiple of the bus frequency



# **Interconnect Performance**

### Bandwidth -- usually limited by cost to fast serial connection

- Crossbar provides very high bandwidth (n simultaneous connections); but costs  $O(n^2)$  in terms of switching nodes
- Omega network provides potentially high bandwidth, but suffers from blocking/congestion
- 10 Mbit/sec common for Ethernet; 100 Mbit/sec being introduced
  - Also limited by cost of switch to avoid sharing of high-speed line

### Latency -- limited by routing

- Crossbar offers lowest latency, but at high cost
- Each "hop" on network requires passing through an embedded processor/switch - Can be many seconds for the Internet

  - Omega network provides high potential bandwidth, but at cost of latency of  $\log n$ switching stages





# Main Memory

### • Main Memory is what programmers (think they) manipulate

- Program space
- Data space
- Commonly referred to as "physical memory" (as opposed to "virtual memory")

### Typically constructed from DRAM chips

- Multiple clock cycles to access data, but may operate in a "burst" mode once data access is started
- Optimized for capacity, not necessarily speed

### Latency -- determined by DRAM construction

- Shared pins for high & low half of address to save on packaging costs
- Typically 2 or 3 bus cycles to begin accessing data
- Once access initiated can return multiple data at rate of datum per bus clock

# <section-header> **Main Memory Capacities Main memory capacity is determined by DRAM chip**A least 1 "bank" of DRAM chips is required for minimum memory size Multiple banks (or bigger chips) used to increase memory capacity **Memory words typically same width as bus**Peak memory bandwidth is usually one word per bus cycle Sustained memory bandwidth varies with the complexity of the design memory. Representative main memory bandwidth is 500 MB/sec peak; 125 MB/sec sustained

# **Special-Purpose Memory**

### Memory dedicated to special hardware use

- Graphics frame buffer
  - Special construction to support high-speed transfers to video screen
  - Dual-ported access for modify-while-display
- Video capture buffer
- Perhaps could consider memory embedded in other devices as specialpurpose system memory
  - Postscript printer memory
  - · Audio sample tables





# "Backing Store"

### • Used for data that doesn't fit into main memory

- "Virtual memory" uses it to emulate a larger physical memory
- File systems use it for nonvolatile storage of information

### • Hard disk technology is prevalent, but don't forget:

- Flash memory (especially for mobile computing) based on IC technology
- Floppy disks (up to 100 MB)
- CD-ROM (and WORM, and magneto-optical, and ...)
- Novel storage (bar codes, optical paper tape...)

### Magnetic disks have killed off:

- Magnetic drums
- · Punched cards
- Paper tape
- Bubble memory
- Magnetic tape (not quite dead yet)







|                      | Capacity    | Latency              | Bandwidth      |
|----------------------|-------------|----------------------|----------------|
| Registers            | 128 Bytes+  | 1 CPU clock (2-5 ns) | 3-10 GB/sec    |
| Cache                | 8 KB - 1 MB | 1-10 clocks          | 3-5 GB/sec     |
| Bus                  |             | 10+ clocks           | 0.5 - 1 GB/sec |
| Main Memory          | 8 MB - 1 GB | 25 - 100 clocks      | 0.1 - 1 GB/sec |
| Interconnect         |             | 1-10 msec            | 1-20 MB/sec    |
| <b>Backing Store</b> | 8 GB+       | 5-10 msec            | 1-10 MB/sec    |
|                      |             |                      |                |









# **Key Concepts**

### Latency

• Higher levels in hierarchy have lower latency because there are shorter and less complicated interconnections

### Bandwidth

• Generally higher levels in hierarchy have greater bandwidth because it's cheaper to run wider data paths with short "distances" and low fanout

### Concurrency

### · Replication of resources can improve bandwidth for a cost

- Split caches
- Concurrent interconnection paths
- Multiple memory banks
- Multiple mass storage devices

### Balance

• Plumbing diagrams can give a first-order approximation of system balance for throughput

## **Review**

### Principles of Locality

• Temporal, spatial, sequential

### Physical Memory Hierarchy:

- CPU registers -- smallest & fastest; measured in Bytes
- Cache -- almost as fast as CPU; measured in KB
- Bus/Interconnect -- bandwidth costs money
- Main Memory -- slow but large; measured in MB
- Mass storage -- slower and larger; measured in GB

### Bandwidth "Plumbing" diagrams

· Back-of-envelope calculations can demonstrate bottlenecks