buzzwords

Buzz Words

These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class.

Lecture 2

ISA, Trade-offs, Performance

ISA vs Microarchitecture
Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders)
Speculation (branch prediction overview)
Superscalar processing (multiple instruction issue, VLIW)
Prefetching (briefly covered)
Power/clock gating (briefly covered)

Lecture 3

Performance

Addressing modes (Registers, Memory, Base-index-displacement)
0/1/2/3 Address machines (accumulator machines, stack machines)
Aligned accesses (Hardware handled vs Software handled: trade-offs)
Transactional memory (overview)
Von-Neumann model (Sequential instruction execution)
Data flow machines (Data driven execution)
Evaluating performance

Lecture 4

Pipelining

Performance metrics (Instructions per second, FLOPS, Spec/Mhz)
Amdahl's law
Pipelining vs Multi-cycle machines
Stalling (dependencies)
Data dependencies (true, output, anti)
Value prediction
Control dependencies
Branch prediction / Predication
Fine-grained multi-threading

Lecture 5

Precise Exceptions

Need for separate instruction and data caches
Exceptions vs Interrupts
Reorder buffer/History buffer/Future files
Impact of size of architectural register file
Checkpointing
Reservation stations
Precise exceptions

Lecture 6

Virtual Memory

Problems in early machines (size, protection, relocation, contiguity, sharing)
Segmentation (solves some of these problems)
Virtual memory (high level)
Implementing translations (page tables, inverted page tables)
TLBs (caching translations)
Virtually-tagged caches and Physically-tagged caches
Address space identifiers
Sharing (overview)
Super pages (overview)

Lecture 7

Out-of-Order Execution

Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring)
Super scalar vs Out-of-Order execution
Precise exceptions (summary)
Branch prediction vs exceptions
Out-of-Order execution (preventing dispatch stalls)
Tomasulo's algorithm
Dynamic data-flow graph generation

Lecture 8

Exploiting ILP

Overview of Tomasulo's algorithm
Problems with increasing instruction window to entire program
1. number of comparators
2. ROB size - instruction window
Implementation issues in out-of-order processing (decoupling structures)
1. physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file)
2. Register alias tables and architectural register alias table
Handling branch miss prediction (don't have to wait for branch to become oldest)
1. checkpoint the register alias table on each branch
2. what if we have a lot of branches? (confidence estimation)
Handling stores
1. store buffers
2. load searches store buffer/cache
3. what if addresses are not yet computed for some stores? (unknown address problem)
4. how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer)
Problems with physical register files
1. register file read always on the critical path
Multiple instruction issue
Distributed/Centralized reservation stations
Scheduling
1. Design Issues
Dependents on a load miss
1. FIFO for load dependants
2. Register deallocation
Latency tolerance of out of order processors

Lecture 9

Caching Basics

Caches
Direct mapped caches
Set associative caches
Miss rate
Average Memory Access Time
Cache placement
Cache replacement
Handling writes in caches
Inclusion/Exclusion

Lecture 10

Runahead and MLP

Lack of temporal and spatial locality (long strides, large working sets)
Stride prefetching (software vs hardware: complexity, dynamic information)
Irregular accesses (hash tables, linked structures?)
Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows)
Run-ahead execution (Generating cache misses that can be serviced in parallel)
Problems with run-ahead
1. length of run-ahead
2. what if new cache-miss is dependent on original miss?
3. Branch mispredictions and miss dependent branches
DRAM bank organization
Tolerating memory latencies
1. Caching
2. Prefetching
3. Multi-threading
4. Out of order execution
Fine-grained multi-theading (design and costs)
Causes of inefficiency in Run-ahead (energy consumption)
Breaking dependence
1. address prediction (AVD prediction)

Lecture 11

OOO wrap-up and Advanced Caching

Dual Core Execution (DCE)
Comparison between run ahead and DCE
1. Lag between the front and the back cores - controlled by result queue sizing
Slipstreaming
SMT Architectures for slipstreaming instead of 2 separate cores
Result queue length in DCE
Store-Load dependencies
Store buffer design
1. Content associative, age ordered list of stores
Memory Disambiguation
1. Load dependence/independence on previous stores
2. Store/Load dependence prediction
Speculative execution and data coherence
1. Load buffer
Research issues in OoO
1. Scalable and energy-efficient instruction windows
2. Packing more MLP into a small window
OOO in Multi core systems
1. Memory system contention - bigger issue
2. Multiple cores to perform OOO
3. Asymmetric Multi-cores
Symmetric vs Asymmetric multi cores
1. Accelerating critical sections
2. Core fusion

Inclusion/Exclusion
Multi-level caching

Lecture 12

Advanced Caching

Handling writes
1. Write-back
2. Write-through
3. Write allocate/no allocate
Instruction/Data Caching
Cache Replacement Policies
1. Random
2. FIFO
3. Least Recently Used
4. Not Most Recently Used
5. Least Frequently used
LRU Vs Random - Random as good as LRU for most practical workloads
Optimal Replacement Policy
MLP aware cache replacement
Cache Performance
1. Reducing miss rate
2. Reducing miss latency/cost
Cache Parameters
1. Cache size vs hit rate
2. Block size
3. Large Blocks - Critical words
4. Large blocks - bandwidth wastage
5. Sub blocking
6. Associativity
7. Power of 2 associativity?
8. Hybrid Replacement policies
9. Sampling based hybrid (random/LRU) replacement
Cache Misses
1. Compulsory
2. Conflict
3. Capacity
4. Coherence
Cache aware schedulers - cache affinity based application mapping
Victim caches
Hashing - Randomizing index functions
Pseudo associativity - serial cache lookup

Lecture 13

More Caching

Speculative partial tag comparison
Skewed associative caches
1. Randomizing the index for different ways
Improving hit rate in software
1. Loop interchange - Row major, Column major
2. Blocking
3. Loop fusion, Array merging
4. Data structure layout - Packing frequently used fields in arrays
Handling multiple outstanding misses
1. Non blocking caches
2. Miss Status Handling Registers (MSHR)
3. Accessing MSHRs
Reducing miss latency through software
1. Compiler level reordering of loops
2. Software prefetching
Handling multiple accesses in a cycle
1. True/Virtual multiporting
2. Banking/Interleaving

Lecture 14

Prefetching

Compulsory/conflict/capacity misses and prefetching
Coherence misses and prefetching
False sharing
1. Word/byte based coherence
2. Value prediction/Speculative execution
Prefetching and correctness
What/When/Where/How
1. Accuracy
2. Timeliness
3. Coverage
4. Prefetch buffers
5. Skewing prefetches towards demand fetches
6. Software/Hardware/Execution-based prefetchers
Software prefetching
1. Binding/Non-binding prefetches
2. Prefetching during pointer chasing
3. x86 prefetch instructions - prefetching into different levels of cache
4. Handling of prefetches that cause TLB misses/page faults
5. Compiler driven prefetching
6. Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load
Hardware prefetching
1. Next line prefetchers
2. Stride prefetchers
3. Instruction based stride prefetchers
4. Stream buffers
5. Locality based prefetching
Prefetcher performance
1. Accuracy
2. Coverage
3. Timeliness
4. Aggressiveness
  1. Prefetcher distance
  2. Prefetcher degree
Irregular prefetching
1. Markov prefetchers

Lecture 15

Prefetching (wrap up)

Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D)
Irregular patterns (indirect array accesses, linked structures)
1. markov prefetching
  1. linked lists or trees
  2. markov prefetchers vs stride prefetches
Content directed prefetching
1. pointer based structures
2. identifying pointers (software mechanism/hardware prediction)
3. compiler analysis to provide hints for useful prefetches
Hybrid prefetchers
Execution based prefetchers
1. pre-execution thread for creating prefetches for the main program
2. determining when to start the pre-execution
3. similar idea for branch prediction

Lecture 16

Prefetching in multicores
Importance of prefetch efficiency
Issues with local prefefetcher throttling
Hierarchical prefetcher throttling
Cache coherence
1. Snoopy cache coherence
Shared caches in multicores
Utility based cache partitioning

Lecture 17

Software based cache management
1. Thread scheduling
2. Page coloring
3. Dynamic partitioning through page recoloring
Cache placement
1. Insertion
2. Re-insertion
3. Circular reference model
4. Dynamic insertion policy - LRU and Bimodal insertion

Lecture 19

Main memory system

Memory hierarchy
SRAM/DRAM cell structures
Memory bank organization
Page mode DRAM
1. Bank operation
Basic DRAM operation
1. Controller latency
2. Bank latency
DRAM chips/DIMMs
DRAM channels
Address mapping/interleaving
Bank mapping randomization
DRAM refresh
DRAM controller issues
1. Memory controller placement

Lecture 20

DRAM controller functions
1. Refresh
2. Scheduling
DRAM Scheduling policies
1. FCFS
2. FR-FCFS
Row buffer management policies
1. Open row
2. Closed row
DRAM controller design
1. Machine learning - a possibility
DRAM power states
Inter-thread interference interference in DRAM
Multi-core DRAM controllers
1. Stall-time fairness
  1. Unfairness
  2. Estimating alone runtime
  3. Providing system software support
2. Parallelism aware batch scheduling
  1. Request batching
  2. Within batch scheduling

Lecture 21

Super scalar processing I

Types of parallelism
1. Task
2. Thread
3. Instruction

Fetch stage
1. Instruction alignment Issue
2. Solution - Split cache line fetch
3. Fetch break - branches in the fetch block

Fetch break solutions
1. Short distance predicted-taken branch
2. Basic block reordering
3. Super block code optimization
4. Trace cache

Lecture 22

Super scalar processing II

Trace Caches
1. Multiple branch prediction aliasing issue
2. Inactive issue
3. Promoting highly biased branches to static branches
4. Saturating counters
5. Fill unit optimizations
  1. Making highly biased branch paths atomic
6. Redundancy - Solution : Block based trace cache
7. An Enhanced Instruction cache vs a trace cache
8. Pentium 4 trace cache
Block structured ISA
1. Enlarged block branches - faults
2. Super block vs Block structured ISAs
Decode in superscalar processing
1. Predecoding
2. Decode cache
3. CISC to RISC translation in hardware
4. Micro code sequencing
5. Pentium 4 decoders
6. Prentium pro decoders
  1. simple decoders
  2. Complex decoder
  3. Micro op sequencer
7. Instruction buffering fetch and decode

Lecture 23

Superscalar Processing III

Renaming multiple instructions
1. dependency check logic (n^2 comparators)
2. help from compiler
  - ensure instructions are independent (difficult for wide fetches)
  - hardware-software co-design to simplify dependency logic

Dispatching multiple instructions
1. wakeup logic (compare all tags in reservation station with all the tags that are broadcast)
2. select logic (hierarchical tree based selection)

Execute
1. enough execution units
2. enough forwarding paths (broadcast tag/value to all functional units)

Reducing dispatch+bypass delays
1. clustering (divide window into multiple clusters)
2. intra-cluster bypass is fast
3. inter-cluster bypass can be slow

Register file
1. need multiple reads/writes per cycle
2. Replicate or partition the register files
3. using block-structured ISA

Retirement
1. updating architectural register map

Lecture 24

Control Flow

Problem of branches
Types
- conditional, unconditional, call, return, indirect branches
Handling conditional branches
Predicate combining
- condition codes vs condition registers
Delayed branching
Fine-grained multi-threading
Branch prediction
- predicting if an instruction is a branch (predecoding)
- predicting the direction of the branch
- predicting the target address of a branch
Static branch predition
- always taken/not taken
- backward taken, forward not taken
- by compiler based on profiling
Dynamic branch prediction
- last time predictor
- history based predictors
- two-level predictors

Lecture 25

Control Flow - II

2-bit counter based prediction
Global branch prediction
Global branch correlation
Global two-level prediction
1. Global history register
Local two-level prediction
1. Pattern history table
2. Interference in the pattern history table
  1. Randomizing the index into the pattern history table
  2. Agree prediction
Alpha 21264 Tournament Predictor
Perceptron branch predictor
1. Perceptron - learns a target boolean function of N inputs
Call and Return Prediction
Indirect branch prediction
1. Virtual Conditional Branch prediction
Branch prediction issues
1. Need to know a branch as soon as it is fetched
2. Latency
3. State recovery upon misprediction
Predicated execution

Lecture 26

Control Flow - III & Concurrency

Predicated Execution
1. Predication decisions at the compiler
2. Rename stage modifications
Limitations of predication
1. Adaptivity
2. Complex Control Flow Graphs
3. ISA support
Wish branches
1. Wish jump/join
2. Wish loop
Wish branches vs Predicated Execution
Wish branches vs Branch prediction
Diverge-Merge Processor
Dynamic-Hammock
Multi-path Execution
Research issues in control flow handling
1. Hardware/software cooperation
2. Fetch gating
3. Recycling useful work done on wrong path

Concurrency

Classification of machines
1. SISD
2. SIMD
3. MIMD
Decoupled Access/Execute
Astronautics ZS-1
Loop unrolling

Lecture 27

VLIW

Each VLIW instruction - a bundle of independent instructions (identified by compiler)
Each instruction bundle executed by hardware in lockstep
Commercial VLIW machines
1. TIC6000, Trimedia, STMicro
Intel IA-64 - Partially VLIW
Encoding VLIW NOPs
Static Instruction Scheduling for VLIW
Code motion - Safety & Legality
Trace scheduling
List scheduling
Super block scheduling
Hyperblock scheduling
The Intel IA-64 architecture
1. No lock step execution of a bundle
2. Specify dependencies between instructions within a bundle
3. Template bits
What hinder static mode motion?
1. Exceptions
2. Loads/Stores