===== Buzz Words =====
These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class.

===== Lecture 2 =====
  ISA, Trade-offs, Performance

  * ISA vs Microarchitecture
  * Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders)
  * Speculation (branch prediction overview)
  * Superscalar processing (multiple instruction issue, VLIW)
  * Prefetching (briefly covered)
  * Power/clock gating (briefly covered)

===== Lecture 3 =====
  Performance

  * Addressing modes (Registers, Memory, Base-index-displacement)
  * 0/1/2/3 Address machines (accumulator machines, stack machines)
  * Aligned accesses (Hardware handled vs Software handled: trade-offs)
  * Transactional memory (overview)
  * Von-Neumann model (Sequential instruction execution)
  * Data flow machines (Data driven execution)
  * Evaluating performance

===== Lecture 4 =====
  Pipelining

  * Performance metrics (Instructions per second, FLOPS, Spec/Mhz)
  * Amdahl's law
  * Pipelining vs Multi-cycle machines
  * Stalling (dependencies)
  * Data dependencies (true, output, anti)
  * Value prediction
  * Control dependencies 
  * Branch prediction / Predication
  * Fine-grained multi-threading

===== Lecture 5 =====
  Precise Exceptions

  * Need for separate instruction and data caches
  * Exceptions vs Interrupts
  * Reorder buffer/History buffer/Future files
  * Impact of size of architectural register file
  * Checkpointing
  * Reservation stations
  * Precise exceptions

===== Lecture 6 =====
  Virtual Memory

  * Problems in early machines (size, protection, relocation, contiguity, sharing)
  * Segmentation (solves some of these problems)
  * Virtual memory (high level)
  * Implementing translations (page tables, inverted page tables)
  * TLBs (caching translations)
  * Virtually-tagged caches and Physically-tagged caches
  * Address space identifiers
  * Sharing (overview)
  * Super pages (overview)

===== Lecture 7 =====
  Out-of-Order Execution

  * Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring)
  * Super scalar vs Out-of-Order execution
  * Precise exceptions (summary)
  * Branch prediction vs exceptions
  * Out-of-Order execution (preventing dispatch stalls)
  * Tomasulo's algorithm
  * Dynamic data-flow graph generation


===== Lecture 8 =====
  Exploiting ILP

  * Overview of Tomasulo's algorithm
  * Problems with increasing instruction window to entire program 
    - number of comparators
    - ROB size - instruction window
  * Implementation issues in out-of-order processing (decoupling structures)
    - physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file)
    - Register alias tables  and architectural register alias table
  * Handling branch miss prediction (don't have to wait for branch to become oldest)
    - checkpoint the register alias table on each branch
    - what if we have a lot of branches? (confidence estimation)
  * Handling stores
    - store buffers 
    - load searches store buffer/cache
    - what if addresses are not yet computed for some stores? (unknown address problem)
    - how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer)
  * Problems with physical register files
    - register file read always on the critical path
  * Multiple instruction issue
  * Distributed/Centralized reservation stations
  * Scheduling
    - Design Issues
  * Dependents on a load miss
    - FIFO for load dependants
    - Register deallocation
  * Latency tolerance of out of order processors

===== Lecture 9 =====
  Caching Basics

  * Caches
  * Direct mapped caches
  * Set associative caches
  * Miss rate
  * Average Memory Access Time
  * Cache placement
  * Cache replacement
  * Handling writes in caches
  * Inclusion/Exclusion

===== Lecture 10 =====
  Runahead and MLP

  * Lack of temporal and spatial locality (long strides, large working sets)
  * Stride prefetching (software vs hardware: complexity, dynamic information)
  * Irregular accesses (hash tables, linked structures?)
  * Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows)
  * Run-ahead execution (Generating cache misses that can be serviced in parallel)
  * Problems with run-ahead
    - length of run-ahead
    - what if new cache-miss is dependent on original miss?
    - Branch mispredictions and miss dependent branches
  * DRAM bank organization
  * Tolerating memory latencies 
    - Caching
    - Prefetching
    - Multi-threading
    - Out of order execution
  * Fine-grained multi-theading (design and costs)
  * Causes of inefficiency in Run-ahead (energy consumption)
  * Breaking dependence
    - address prediction (AVD prediction)

===== Lecture 11 =====
  OOO wrap-up and Advanced Caching

  * Dual Core Execution (DCE)
  * Comparison between run ahead and DCE
    - Lag between the front and the back cores - controlled by result queue sizing
  * Slipstreaming
  * SMT Architectures for slipstreaming instead of 2 separate cores
  * Result queue length in DCE
  * Store-Load dependencies
  * Store buffer design
    - Content associative, age ordered list of stores
  * Memory Disambiguation
    - Load dependence/independence on previous stores
    - Store/Load dependence prediction
  * Speculative execution and data coherence
    - Load buffer 
  * Research issues in OoO
    - Scalable and energy-efficient instruction windows
    - Packing more MLP into a small window
  * OOO in Multi core systems
    - Memory system contention - bigger issue
    - Multiple cores to perform OOO
    - Asymmetric Multi-cores
  * Symmetric vs Asymmetric multi cores
    - Accelerating critical sections
    - Core fusion
  
  * Inclusion/Exclusion
  * Multi-level caching

===== Lecture 12 =====
  Advanced Caching

  * Handling writes
    - Write-back
    - Write-through
    - Write allocate/no allocate
  * Instruction/Data Caching
  * Cache Replacement Policies
    - Random
    - FIFO
    - Least Recently Used
    - Not Most Recently Used
    - Least Frequently used
  * LRU Vs Random - Random as good as LRU for most practical workloads
  * Optimal Replacement Policy
  * MLP aware cache replacement
  * Cache Performance
    - Reducing miss rate
    - Reducing miss latency/cost
  * Cache Parameters
    - Cache size vs hit rate
    - Block size
    - Large Blocks - Critical words
    - Large blocks - bandwidth wastage
    - Sub blocking
    - Associativity
    - Power of 2 associativity?
    - Hybrid Replacement policies
    - Sampling based hybrid (random/LRU) replacement
  * Cache Misses
    - Compulsory
    - Conflict
    - Capacity
    - Coherence
  * Cache aware schedulers - cache affinity based application mapping
  * Victim caches
  * Hashing - Randomizing index functions
  * Pseudo associativity - serial cache lookup

===== Lecture 13 ====
  More Caching

  * Speculative partial tag comparison
  * Skewed associative caches
    - Randomizing the index for different ways
  * Improving hit rate in software
    - Loop interchange - Row major, Column major
    - Blocking
    - Loop fusion, Array merging
    - Data structure layout - Packing frequently used fields in arrays
  * Handling multiple outstanding misses
    - Non blocking caches
    - Miss Status Handling Registers (MSHR)
    - Accessing MSHRs
  * Reducing miss latency through software
    - Compiler level reordering of loops
    - Software prefetching
  * Handling multiple accesses in a cycle
    - True/Virtual multiporting
    - Banking/Interleaving 
   

===== Lecture 14 ====
  Prefetching

  * Compulsory/conflict/capacity misses and prefetching
  * Coherence misses and prefetching
  * False sharing
    - Word/byte based coherence
    - Value prediction/Speculative execution
  * Prefetching and correctness
  * What/When/Where/How
    - Accuracy
    - Timeliness
    - Coverage
    - Prefetch buffers
    - Skewing prefetches towards demand fetches
    - Software/Hardware/Execution-based prefetchers
  * Software prefetching
    - Binding/Non-binding prefetches
    - Prefetching during pointer chasing
    - x86 prefetch instructions - prefetching into different levels of cache
    - Handling of prefetches that cause TLB misses/page faults
    - Compiler driven prefetching
    - Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load
  * Hardware prefetching
    - Next line prefetchers
    - Stride prefetchers
    - Instruction based stride prefetchers
    - Stream buffers
    - Locality based prefetching
  * Prefetcher performance
    - Accuracy
    - Coverage
    - Timeliness
    - Aggressiveness
      - Prefetcher distance
      - Prefetcher degree
  * Irregular prefetching
    - Markov prefetchers

===== Lecture 15 ====
  Prefetching (wrap up)

  * Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D)
  * Irregular patterns (indirect array accesses, linked structures)
    - markov prefetching
      - linked lists or trees
      - markov prefetchers vs stride prefetches
  * Content directed prefetching
    - pointer based structures
    - identifying pointers (software mechanism/hardware prediction)
    - compiler analysis to provide hints for useful prefetches
  * Hybrid prefetchers
  * Execution based prefetchers
    - pre-execution thread for creating prefetches for the main program
    - determining when to start the pre-execution
    - similar idea for branch prediction

===== Lecture 16 ====
  * Prefetching in multicores
  * Importance of prefetch efficiency
  * Issues with local prefefetcher throttling 
  * Hierarchical prefetcher throttling
  * Cache coherence
    - Snoopy cache coherence
  * Shared caches in multicores
  * Utility based cache partitioning

===== Lecture 17 ====

  * Software based cache management
    - Thread scheduling
    - Page coloring
    - Dynamic partitioning through page recoloring
  * Cache placement
    - Insertion 
    - Re-insertion
    - Circular reference model
    - Dynamic insertion policy - LRU and Bimodal insertion

===== Lecture 19 ====
  Main memory system

  * Memory hierarchy
  * SRAM/DRAM cell structures
  * Memory bank organization
  * Page mode DRAM
    - Bank operation
  * Basic DRAM operation
    - Controller latency
    - Bank latency
  * DRAM chips/DIMMs
  * DRAM channels
  * Address mapping/interleaving
  * Bank mapping randomization
  * DRAM refresh
  * DRAM controller issues
    - Memory controller placement

===== Lecture 20 =====

  * DRAM controller functions
    - Refresh
    - Scheduling
  * DRAM Scheduling policies
    - FCFS
    - FR-FCFS
  * Row buffer management policies
    - Open row
    - Closed row
  * DRAM controller design
    - Machine learning - a possibility 
  * DRAM power states
  * Inter-thread interference interference in DRAM
  * Multi-core DRAM controllers
    - Stall-time fairness
      - Unfairness
      - Estimating alone runtime 
      - Providing system software support
    - Parallelism aware batch scheduling
      - Request batching
      - Within batch scheduling

===== Lecture 21 ====
  Super scalar processing I

  * Types of parallelism
    - Task
    - Thread
    - Instruction

  * Fetch stage
    - Instruction alignment Issue
    - Solution - Split cache line fetch
    - Fetch break - branches in the fetch block

  * Fetch break solutions
    - Short distance predicted-taken branch
    - Basic block reordering
    - Super block code optimization
    - Trace cache

===== Lecture 22 ====
  Super scalar processing II

  * Trace Caches
    - Multiple branch prediction aliasing issue
    - Inactive issue
    - Promoting highly biased branches to static branches
    - Saturating counters
    - Fill unit optimizations
      - Making highly biased branch paths atomic
    - Redundancy - Solution : Block based trace cache
    - An Enhanced Instruction cache vs a trace cache
    - Pentium 4 trace cache
  * Block structured ISA
    - Enlarged block branches - faults
    - Super block vs Block structured ISAs
  * Decode in superscalar processing
    - Predecoding
    - Decode cache
    - CISC to RISC translation in hardware
    - Micro code sequencing
    - Pentium 4 decoders
    - Prentium pro decoders
      - simple decoders
      - Complex decoder
      - Micro op sequencer
    - Instruction buffering fetch and decode

===== Lecture 23 ====
  Superscalar Processing III

  * Renaming multiple instructions
    - dependency check logic (n^2 comparators)
    - help from compiler
      * ensure instructions are independent (difficult for wide fetches)
      * hardware-software co-design to simplify dependency logic

  * Dispatching multiple instructions
    - wakeup logic (compare all tags in reservation station with all the tags that are broadcast)
    - select logic (hierarchical tree based selection)

  * Execute
    - enough execution units
    - enough forwarding paths (broadcast tag/value to all functional units)

  * Reducing dispatch+bypass delays
    - clustering (divide window into multiple clusters)
    - intra-cluster bypass is fast
    - inter-cluster bypass can be slow

  * Register file
    - need multiple reads/writes per cycle
    - Replicate or partition the register files
    - using block-structured ISA

  * Retirement
    - updating architectural register map
    

===== Lecture 24 ====
  Control Flow

  * Problem of branches
  * Types
    * conditional, unconditional, call, return, indirect branches
  * Handling conditional branches
  * Predicate combining
    * condition codes vs condition registers
  * Delayed branching
  * Fine-grained multi-threading
  * Branch prediction
    * predicting if an instruction is a branch (predecoding)
    * predicting the direction of the branch
    * predicting the target address of a branch
  * Static branch predition
    * always taken/not taken
    * backward taken, forward not taken
    * by compiler based on profiling
  * Dynamic branch prediction
    * last time predictor
    * history based predictors
    * two-level predictors
===== Lecture 25 ====
  Control Flow - II

  * 2-bit counter based prediction
  * Global branch prediction
  * Global branch correlation
  * Global two-level prediction
    - Global history register
  * Local two-level prediction
    - Pattern history table
    - Interference in the pattern history table
      - Randomizing the index into the pattern history table
      - Agree prediction
  * Alpha 21264 Tournament Predictor
  * Perceptron branch predictor
    - Perceptron - learns a target boolean function of N inputs
  * Call and Return Prediction
  * Indirect branch prediction
    - Virtual Conditional Branch prediction
  * Branch prediction issues
    - Need to know a branch as soon as it is fetched
    - Latency
    - State recovery upon misprediction
  * Predicated execution

==== Lecture 26 ====
  Control Flow - III & Concurrency

  * Predicated Execution
    - Predication decisions at the compiler
    - Rename stage modifications
  * Limitations of predication
    - Adaptivity
    - Complex Control Flow Graphs
    - ISA support
  * Wish branches
    - Wish jump/join
    - Wish loop
  * Wish branches vs Predicated Execution
  * Wish branches vs Branch prediction
  * Diverge-Merge Processor
  * Dynamic-Hammock
  * Multi-path Execution
  * Research issues in control flow handling
    - Hardware/software cooperation
    - Fetch gating
    - Recycling useful work done on wrong path
  Concurrency
  * Classification of machines
    - SISD
    - SIMD
    - MIMD
  * Decoupled Access/Execute
  * Astronautics ZS-1
  * Loop unrolling

==== Lecture 27 ====
  VLIW

  * Each VLIW instruction - a bundle of independent instructions (identified by compiler)
  * Each instruction bundle executed by hardware in lockstep
  * Commercial VLIW machines
     - TIC6000, Trimedia, STMicro
  * Intel IA-64 - Partially VLIW
  * Encoding VLIW NOPs
  * Static Instruction Scheduling for VLIW
  * Code motion - Safety & Legality
  * Trace scheduling
  * List scheduling
  * Super block scheduling
  * Hyperblock scheduling
  * The Intel IA-64 architecture
     - No lock step execution of a bundle
     - Specify dependencies between instructions within a bundle
     - Template bits
  * What hinder static mode motion?
     - Exceptions
     - Loads/Stores