These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class.
ISA, Trade-offs, Performance
ISA vs Microarchitecture
Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders)
Speculation (branch prediction overview)
Superscalar processing (multiple instruction issue, VLIW)
Prefetching (briefly covered)
Power/clock gating (briefly covered)
Performance
Addressing modes (Registers, Memory, Base-index-displacement)
0/1/2/3 Address machines (accumulator machines, stack machines)
Aligned accesses (Hardware handled vs Software handled: trade-offs)
Transactional memory (overview)
Von-Neumann model (Sequential instruction execution)
Data flow machines (Data driven execution)
Evaluating performance
Pipelining
Performance metrics (Instructions per second, FLOPS, Spec/Mhz)
Amdahl's law
Pipelining vs Multi-cycle machines
Stalling (dependencies)
Data dependencies (true, output, anti)
Value prediction
Control dependencies
Branch prediction / Predication
Fine-grained multi-threading
Precise Exceptions
Need for separate instruction and data caches
Exceptions vs Interrupts
Reorder buffer/History buffer/Future files
Impact of size of architectural register file
Checkpointing
Reservation stations
Precise exceptions
Virtual Memory
Problems in early machines (size, protection, relocation, contiguity, sharing)
Segmentation (solves some of these problems)
Virtual memory (high level)
Implementing translations (page tables, inverted page tables)
TLBs (caching translations)
Virtually-tagged caches and Physically-tagged caches
Address space identifiers
Sharing (overview)
Super pages (overview)
Out-of-Order Execution
Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring)
Super scalar vs Out-of-Order execution
Precise exceptions (summary)
Branch prediction vs exceptions
Out-of-Order execution (preventing dispatch stalls)
Tomasulo's algorithm
Dynamic data-flow graph generation
Exploiting ILP
Overview of Tomasulo's algorithm
Problems with increasing instruction window to entire program
number of comparators
ROB size - instruction window
Implementation issues in out-of-order processing (decoupling structures)
physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file)
Register alias tables and architectural register alias table
Handling branch miss prediction (don't have to wait for branch to become oldest)
checkpoint the register alias table on each branch
what if we have a lot of branches? (confidence estimation)
Handling stores
store buffers
load searches store buffer/cache
what if addresses are not yet computed for some stores? (unknown address problem)
how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer)
Problems with physical register files
register file read always on the critical path
Multiple instruction issue
Distributed/Centralized reservation stations
Scheduling
Design Issues
Dependents on a load miss
FIFO for load dependants
Register deallocation
Latency tolerance of out of order processors
Runahead and MLP
Lack of temporal and spatial locality (long strides, large working sets)
Stride prefetching (software vs hardware: complexity, dynamic information)
Irregular accesses (hash tables, linked structures?)
Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows)
Run-ahead execution (Generating cache misses that can be serviced in parallel)
Problems with run-ahead
length of run-ahead
what if new cache-miss is dependent on original miss?
Branch mispredictions and miss dependent branches
DRAM bank organization
Tolerating memory latencies
Caching
Prefetching
Multi-threading
Out of order execution
Fine-grained multi-theading (design and costs)
Causes of inefficiency in Run-ahead (energy consumption)
Breaking dependence
address prediction (AVD prediction)
OOO wrap-up and Advanced Caching
Dual Core Execution (DCE)
Comparison between run ahead and DCE
Lag between the front and the back cores - controlled by result queue sizing
Slipstreaming
SMT Architectures for slipstreaming instead of 2 separate cores
Result queue length in DCE
Store-Load dependencies
Store buffer design
Content associative, age ordered list of stores
Memory Disambiguation
Load dependence/independence on previous stores
Store/Load dependence prediction
Speculative execution and data coherence
Load buffer
Research issues in OoO
Scalable and energy-efficient instruction windows
Packing more MLP into a small window
OOO in Multi core systems
Memory system contention - bigger issue
Multiple cores to perform OOO
Asymmetric Multi-cores
Symmetric vs Asymmetric multi cores
Accelerating critical sections
Core fusion
Inclusion/Exclusion
Multi-level caching
Advanced Caching
Handling writes
Write-back
Write-through
Write allocate/no allocate
Instruction/Data Caching
Cache Replacement Policies
Random
FIFO
Least Recently Used
Not Most Recently Used
Least Frequently used
LRU Vs Random - Random as good as LRU for most practical workloads
Optimal Replacement Policy
MLP aware cache replacement
Cache Performance
Reducing miss rate
Reducing miss latency/cost
Cache Parameters
Cache size vs hit rate
Block size
Large Blocks - Critical words
Large blocks - bandwidth wastage
Sub blocking
Associativity
Power of 2 associativity?
Hybrid Replacement policies
Sampling based hybrid (random/LRU) replacement
Cache Misses
Compulsory
Conflict
Capacity
Coherence
Cache aware schedulers - cache affinity based application mapping
Victim caches
Hashing - Randomizing index functions
Pseudo associativity - serial cache lookup
More Caching
Speculative partial tag comparison
Skewed associative caches
Randomizing the index for different ways
Improving hit rate in software
Loop interchange - Row major, Column major
Blocking
Loop fusion, Array merging
Data structure layout - Packing frequently used fields in arrays
Handling multiple outstanding misses
Non blocking caches
Miss Status Handling Registers (MSHR)
Accessing MSHRs
Reducing miss latency through software
Compiler level reordering of loops
Software prefetching
Handling multiple accesses in a cycle
True/Virtual multiporting
Banking/Interleaving
Prefetching
Compulsory/conflict/capacity misses and prefetching
Coherence misses and prefetching
False sharing
Word/byte based coherence
Value prediction/Speculative execution
Prefetching and correctness
What/When/Where/How
Accuracy
Timeliness
Coverage
Prefetch buffers
Skewing prefetches towards demand fetches
Software/Hardware/Execution-based prefetchers
Software prefetching
Binding/Non-binding prefetches
Prefetching during pointer chasing
x86 prefetch instructions - prefetching into different levels of cache
Handling of prefetches that cause TLB misses/page faults
Compiler driven prefetching
Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load
Hardware prefetching
Next line prefetchers
Stride prefetchers
Instruction based stride prefetchers
Stream buffers
Locality based prefetching
Prefetcher performance
Accuracy
Coverage
Timeliness
Aggressiveness
Prefetcher distance
Prefetcher degree
Irregular prefetching
Markov prefetchers
Prefetching (wrap up)
Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D)
Irregular patterns (indirect array accesses, linked structures)
markov prefetching
linked lists or trees
markov prefetchers vs stride prefetches
Content directed prefetching
pointer based structures
identifying pointers (software mechanism/hardware prediction)
compiler analysis to provide hints for useful prefetches
Hybrid prefetchers
Execution based prefetchers
pre-execution thread for creating prefetches for the main program
determining when to start the pre-execution
similar idea for branch prediction
Prefetching in multicores
Importance of prefetch efficiency
Issues with local prefefetcher throttling
Hierarchical prefetcher throttling
Cache coherence
Snoopy cache coherence
Shared caches in multicores
Utility based cache partitioning
Main memory system
Memory hierarchy
SRAM/DRAM cell structures
Memory bank organization
Page mode DRAM
Bank operation
Basic DRAM operation
Controller latency
Bank latency
DRAM chips/DIMMs
DRAM channels
Address mapping/interleaving
Bank mapping randomization
DRAM refresh
DRAM controller issues
Memory controller placement
DRAM controller functions
Refresh
Scheduling
DRAM Scheduling policies
FCFS
FR-FCFS
Row buffer management policies
Open row
Closed row
DRAM controller design
Machine learning - a possibility
DRAM power states
Inter-thread interference interference in DRAM
Multi-core DRAM controllers
Stall-time fairness
Unfairness
Estimating alone runtime
Providing system software support
Parallelism aware batch scheduling
Request batching
Within batch scheduling
Super scalar processing I
Types of parallelism
Task
Thread
Instruction
Fetch stage
Instruction alignment Issue
Solution - Split cache line fetch
Fetch break - branches in the fetch block
Fetch break solutions
Short distance predicted-taken branch
Basic block reordering
Super block code optimization
Trace cache
Super scalar processing II
Superscalar Processing III
Execute
enough execution units
enough forwarding paths (broadcast tag/value to all functional units)
Register file
need multiple reads/writes per cycle
Replicate or partition the register files
using block-structured ISA
Retirement
updating architectural register map
Control Flow
Problem of branches
Types
conditional, unconditional, call, return, indirect branches
Handling conditional branches
Predicate combining
Delayed branching
Fine-grained multi-threading
Branch prediction
predicting if an instruction is a branch (predecoding)
predicting the direction of the branch
predicting the target address of a branch
Static branch predition
Dynamic branch prediction
last time predictor
history based predictors
two-level predictors
Control Flow - II
2-bit counter based prediction
Global branch prediction
Global branch correlation
Global two-level prediction
Global history register
Local two-level prediction
Pattern history table
Interference in the pattern history table
Randomizing the index into the pattern history table
Agree prediction
Alpha 21264 Tournament Predictor
Perceptron branch predictor
Perceptron - learns a target boolean function of N inputs
Call and Return Prediction
Indirect branch prediction
Virtual Conditional Branch prediction
Branch prediction issues
Need to know a branch as soon as it is fetched
Latency
State recovery upon misprediction
Predicated execution
Control Flow - III & Concurrency
Predicated Execution
Predication decisions at the compiler
Rename stage modifications
Limitations of predication
Adaptivity
Complex Control Flow Graphs
ISA support
Wish branches
Wish jump/join
Wish loop
Wish branches vs Predicated Execution
Wish branches vs Branch prediction
Diverge-Merge Processor
Dynamic-Hammock
Multi-path Execution
Research issues in control flow handling
Hardware/software cooperation
Fetch gating
Recycling useful work done on wrong path
Concurrency
VLIW
Each VLIW instruction - a bundle of independent instructions (identified by compiler)
Each instruction bundle executed by hardware in lockstep
Commercial VLIW machines
TIC6000, Trimedia, STMicro
Intel IA-64 - Partially VLIW
Encoding VLIW NOPs
Static Instruction Scheduling for VLIW
Code motion - Safety & Legality
Trace scheduling
List scheduling
Super block scheduling
Hyperblock scheduling
The Intel IA-64 architecture
No lock step execution of a bundle
Specify dependencies between instructions within a bundle
Template bits
What hinder static mode motion?
Exceptions
Loads/Stores