Lecture Notes

Old Lecture Notes (for reference)

Lecture 28: Control Flow II

  • Slides
  • Buzzwords
    • Conditional branches
    • Fine-grained multithreading
    • Predicated execution
    • Convert control dependency into a data dependency
    • Branch elimination
    • Conditional move
    • Fetch break
    • Vector mask
    • “Straight-line” of code
    • Compiler can move around code
    • Multipath execution
    • Wasted work
    • Control flow merge
    • Branch prediction
    • Branch target buffer (BTB)
    • Direction prediction
    • Static vs. dynamic
    • Always not-taken
    • Always taken
    • Backward taken, forward not-taken (BTFN)
    • Profile-based branch prediction
    • Program-based branch prediction
    • Dynamic branch prediction
    • Last time predictor
    • Two-bit counter based prediction
    • Hysteresis
    • Saturating arithmetic
    • Two-level branch predictors
    • Branch history register (BHR)
    • Pattern history table (PHT)
    • Prediction and update
    • Variations of Two-Level Predictors
      • BHR: G, S, P
      • PHT counter: A, S
      • PHT: g, s, p
    • Branch correlation
    • Global two-level predictor
      • Branch interference in BHR
    • Pentium Pro branch predictor
    • Local two-level predictor
    • Hybrid branch predictor
    • Alpha 21264 tournament predictor
    • Prediction accuracy
  • Mentioned readings
    • McFarling, “Combining Branch Predictors,” DEC WRL TR, 1993.
    • Carmean and Sprangle, “Increasing Processor Performance by Implementing Deeper Pipelines,” ISCA 2002.
    • Evers et al., “An Analysis of Correlation and Predictability: What Makes TwoLevel Branch Predictors Work,” ISCA 1998.
    • Yeh and Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction,” ISCA 1992.
    • Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons,” HPCA 2001.
    • Kim et al., “Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths,” MICRO 2006
    • Allen et al., “Conversion of control dependence to data dependence,” POPL 1983.
    • Pettis and Hansen, “Profile guided code positioning,” PLDI 1990.
    • Smith, “A Study of Branch Prediction Strategies,” ISCA 1981
    • Riseman and Foster, “The inhibition of potential parallelism by conditional jumps,” IEEE Transactions on Computers, 1972
    • Ball and Larus, ”Branch prediction for free,” PLDI 1993.

Lecture 28: Prefetching III and Control Flow I

  • Slides
  • Buzzwords
    • Hybrid H/W prefetcher
    • Prefetcher coverage
    • Execution-based prefetcher
    • Speculative thread (“pruned program”)
    • Spawn instruction
    • Where to execute the precomputation thread?
    • When to spawn the precomputation thread?
    • When to terminate the precomputation thread?
    • Problem instructions
    • Pre-execution slice
    • Runahead execution
    • Switch-on-event multithreading
    • Prefetching in multi-core
    • Shared data
    • Resource contention
    • Local vs. hierarchical prefetcher throttling
    • Control flow
    • Branch
    • Fetch address
    • Branch types (conditional, unconditional, call, return, indirect)
    • Branch prediction
    • Eliminating branches
    • Predicate combining
    • Short-circuit evaluation
    • Branch delay slot
    • Delayed branches with squashing
    • Fine-grained multithreading
  • Mentioned readings
    • Dubois and Song, “Assisted Execution,” USC Tech Report 1998.
    • Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),” ISCA 1999.
    • Zilles and Sohi, “Executionbased Prediction Using Speculative Slices”, ISCA 2001.
    • Luk, “Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors,” ISCA 2001.
    • Zilles and Sohi, ”Understanding the backward slices of performance degrading instructions,” ISCA 2000.
    • Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003.
    • Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006.
    • Ebrahimi et al., “Coordinated Management of Multiple Prefetchers in Multi-Core Systems,” MICRO 2009.
    • McFarling, “Combining Branch Predictors,” DEC WRL TR, 1993.
    • Carmean and Sprangle, “Increasing Processor Performance by Implementing Deeper Pipelines,” ISCA 2002.
    • Evers et al., “An Analysis of Correlation and Predictability: What Makes TwoLevel Branch Predictors Work,” ISCA 1998.
    • Yeh and Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction,” ISCA 1992.
    • Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons,” HPCA 2001.
    • Kim et al., “Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths,” MICRO 2006.

Lecture 27: Prefetching II

  • Slides
  • Buzzwords
    • prefetcher training
    • stride prefetching
    • instruction address based prefetching
    • cache block address based prefetching
    • stream buffer
    • locality based prefetching
    • accuracy
    • coverage
    • timeliness
    • bandwidth consumption
    • pollution
    • prefetch distance
    • prefetch degree
    • prefetcher throttling
    • irregular access patterns
    • correlation based prefetchers
    • content-directed prefetchers
    • Markov prevetching
    • precomputation/execution based pre fetchers

Lecture 25: Main Memory & Memory Scheduling

  • Main Memory Slides
  • Memory Scheduling (Thread Cluster Memory Scheduling)
  • Main Memory Buzzwords
    • SRAM (6T) vs. DRAM (1T-1C)
    • Bank
    • Address decoding
    • Sense amplifier
    • SRAM (Static Random Access Memory)
      • Fast access
      • Lower density
      • Simpler to fabricate
    • DRAM (Dynamic Random Access Memory)
      • Slower access
      • Higher density
      • Destructive read
      • Refresh
    • DRAM memory subsystem organization
      • Logical: Channel, rank, bank, row, column
      • Physical: DIMM (dual in-line memory module), device (chip)
    • DRAM row (page)
    • Sense-amplifiers or row-buffer
    • Open row, closed row
    • Activate, read/write, precharge
    • Multiple channels, banks
    • Parallel access
    • Address mapping (row-interleaving vs. block-interleaving)
    • Bank mapping randomization
    • Refresh (burst vs. distributed)
    • DRAM controller (on-chip vs. off-chip)
    • DRAM scheduling
      • FCFS
      • FR-FCFS: Maximize row-buffer hit-rate
  • Memory Scheduling Buzzwords
    • System throughput
    • Fairness
    • Memory-non-intensive vs. memory-intensive
    • Thread clusters (non-intensive vs. intensive)
    • Memory intensity (MPKI: last-level cache misses per kiloinstruction)
    • Memory bandwidth usage
    • Prioritize non-intensive cluster
    • Priority shuffling within intensive cluster
    • Random vs. streaming access pattern
    • Niceness
    • Niceness-aware asymmetric shuffling
    • Weighted speedup, maximum slowdown

Milestone 2 Presentations

Lecture 24: Prefetching

Lecture 23: Virtual Memory, Victim Tag Store

  • Slides (Virtual Memory)
  • Slides (Victim Tag Store)

Lecture 22: Caching In Multi-Core Architectures

  • Slides
  • Buzzwords
    • Insertion
    • Reuse
    • Shared vs. private
    • Placement
    • Application-awareness
    • Cache sharing
      • Free-for-all vs. controlled
    • Cache friendly vs unfriendly
    • Utility-based cache partitioning
      • Utility monitors
      • Streaming
      • Cache-fitting
      • Stack property of utility information
      • Auxiliary tag store (ATS)
      • Dynamic set sampling
      • Partitioning algorithm
      • Way partitioning
    • Performance metrics
      • Weighted speedup
      • Throughput
      • Hmean-fairness
    • Software-based cache management
    • Cache sharing aware thread scheduling
    • Page coloring
    • Static vs. dynamic cache partitioning
    • Page re-coloring
    • Software vs. hardware cache management
    • Shared data in private caches
    • Non-uniform cache access (NUCA)
      • Wire delay
      • Partition
    • Cache as memory bandwidth filter
    • Streaming vs. non-streaming
    • Dead-on-arrival
    • Bimodal insertion policy
    • Dual insertion policy
  • Mentioned readings
    • Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, HighPerformance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.
    • Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.
    • Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.
    • Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009
    • Fedorova et al., “Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler,” PACT 2007.
    • Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008.
    • Cho and Jin, “Managing Distributed, Shared L2 Caches through OSLevel Page Allocation,” MICRO 2006.
    • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009.
    • Kim et al., “An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches,” ASPLOS 2002
    • Lai et al., “Dead Block Prediction,” ISCA 2001.
    • Qureshi et al., “Adaptive Insertion Policies for High-Performance Caching,” ISCA 2007.

Lecture 21: Memory Channel Partitioning

Lecture 20: Caching III

Lecture 19: Caching II

Lecture 18: Caching I

  • Slides
  • Buzzwords
    • Memory latency
    • Multi-level register files
    • Cache
    • Memory hierarchy
    • Fundamental memory tradeoffs
    • Cache hierarchy
    • Temporal locality
    • Spatial locality
    • Cache hit/miss
    • Cache placement/replacement/write policy
    • Instruction cache vs. data cache
    • Cache block (line)
    • Tag store / data store
    • Cache hit rate
    • Average memory access time
    • Data placement
    • Direct-mapped cache
    • Conflict misses
    • Index/tag bits
    • Valid bit
    • Cache set
    • Set associativity
    • Way
    • Fully associative cache
    • LRU replacement policy
    • Random replacement policy
    • Set thrashing
    • Belady's OPT
    • Write through
    • Write back
    • Cache consistency
    • Allocate on write miss
    • No-allocate on write miss
  • Mentioned readings

Lecture 17: Asymmetric Multicore

Lecture 16: Runahead Execution II

Lecture 15: Runahead Execution

Lecture 14: Out-of-Order Execution III

Lecture 13: Q/A Session

  • Projects
  • Last year's midterm

Lecture 12: Out-of-Order Execution III

Lecture 11: Out-of-Order Execution II

  • Buzzwords
    • Static vs. dynamic instruction scheduling
    • Instruction scheduling
      • Branch direction
      • Latency of a load
      • Address of memory operations
    • Reservation station
    • Central physical register file
    • Tomasulo's algorithm
    • Memory disambiguation
    • Advanced load (Itanium)
    • Dataflow graph
    • Renaming eliminates false dependencies
    • Tag broadcast
    • Dataflow processor

Lecture 10: Out-of-Order Execution I

Lecture 9: Precise Exceptions II

  • Slides
  • Buzzwords
    • Precise exceptions
    • Reorder buffer
    • Content addressable memory (CAM)
    • Register renaming
    • False dependencies
      • Output dependency
      • Anti dependency
    • Bypassing data from reorder buffer
    • History buffer
    • Future file
    • Checkpointing
    • Exception handling
    • Architectural state
    • Branch misprediction recovery
    • Store/write buffer
    • Dispatch
  • Mentioned readings

Lecture 8: Precise Exceptions I

Lecture 7: Pipelining

Lecture 6: Performance

Lecture 5: Project Example and Hybrid Main Memory

Lecture 4: ISA Tradeoffs

Lecture 3: SIMD, MIMD, and ISA Principles

  • Slides
  • Mentioned readings
  • Buzzwords
    • Cache-line ping-ponging between cores
    • Synchronization
    • Levels of transformation
    • Instruction set architecture (ISA)
    • Add instruction (ISA) vs. adder implementations: bit-serial, carry lookaehad, ripple carry (microarchitecture)
    • Architecture = ISA (contract between SW/HW) + microarchitecture + circuits
    • uarch changes faster than ISA (backwards compatibility)
    • uarch: implementation of ISA under specific design constraints & goals
    • Design point, problem space
    • Semantic gap: Where to place the ISA?
    • High-level language (HLL) vs. CISC vs. RISC
    • Translation
      • Transmeta: software translation from x86 to underlying VLIW ISA
      • Intel: hardware translation from x86 to microops
    • VAX INDEX instruction

Lecture 2: SIMD, MIMD, NoC Principles and Tradeoffs

Lecture 1: Processing Paradigms and Intro

  • Slides
  • Mentioned readings
  • Buzzwords
    • Pipeline depth vs. branch misprediction penalty (pipeline flush)
    • Eager execution
    • Heilmeier's Catechism (articulating your research)
    • Simulation as a way of dealing with exponential explosion in the trade-off design space.
    • Analytical modeling is another option, but very difficult to capture the inner complexities of the processor.
    • Data parallelism vs. control parallelism
    • Scalar vs. SIMD vs. VLIW vs. Data Flow
    • Two examples of SIMD: array processors and vector processors.

Lecture 0: Intro and Basics


Personal Tools