===== Buzz Words ===== These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class. ===== Lecture 2 ===== ISA, Trade-offs, Performance * ISA vs Microarchitecture * Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders) * Speculation (branch prediction overview) * Superscalar processing (multiple instruction issue, VLIW) * Prefetching (briefly covered) * Power/clock gating (briefly covered) ===== Lecture 3 ===== Performance * Addressing modes (Registers, Memory, Base-index-displacement) * 0/1/2/3 Address machines (accumulator machines, stack machines) * Aligned accesses (Hardware handled vs Software handled: trade-offs) * Transactional memory (overview) * Von-Neumann model (Sequential instruction execution) * Data flow machines (Data driven execution) * Evaluating performance ===== Lecture 4 ===== Pipelining * Performance metrics (Instructions per second, FLOPS, Spec/Mhz) * Amdahl's law * Pipelining vs Multi-cycle machines * Stalling (dependencies) * Data dependencies (true, output, anti) * Value prediction * Control dependencies * Branch prediction / Predication * Fine-grained multi-threading ===== Lecture 5 ===== Precise Exceptions * Need for separate instruction and data caches * Exceptions vs Interrupts * Reorder buffer/History buffer/Future files * Impact of size of architectural register file * Checkpointing * Reservation stations * Precise exceptions ===== Lecture 6 ===== Virtual Memory * Problems in early machines (size, protection, relocation, contiguity, sharing) * Segmentation (solves some of these problems) * Virtual memory (high level) * Implementing translations (page tables, inverted page tables) * TLBs (caching translations) * Virtually-tagged caches and Physically-tagged caches * Address space identifiers * Sharing (overview) * Super pages (overview) ===== Lecture 7 ===== Out-of-Order Execution * Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring) * Super scalar vs Out-of-Order execution * Precise exceptions (summary) * Branch prediction vs exceptions * Out-of-Order execution (preventing dispatch stalls) * Tomasulo's algorithm * Dynamic data-flow graph generation ===== Lecture 8 ===== Exploiting ILP * Overview of Tomasulo's algorithm * Problems with increasing instruction window to entire program - number of comparators - ROB size - instruction window * Implementation issues in out-of-order processing (decoupling structures) - physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file) - Register alias tables and architectural register alias table * Handling branch miss prediction (don't have to wait for branch to become oldest) - checkpoint the register alias table on each branch - what if we have a lot of branches? (confidence estimation) * Handling stores - store buffers - load searches store buffer/cache - what if addresses are not yet computed for some stores? (unknown address problem) - how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer) * Problems with physical register files - register file read always on the critical path * Multiple instruction issue * Distributed/Centralized reservation stations * Scheduling - Design Issues * Dependents on a load miss - FIFO for load dependants - Register deallocation * Latency tolerance of out of order processors ===== Lecture 9 ===== Caching Basics * Caches * Direct mapped caches * Set associative caches * Miss rate * Average Memory Access Time * Cache placement * Cache replacement * Handling writes in caches * Inclusion/Exclusion ===== Lecture 10 ===== Runahead and MLP * Lack of temporal and spatial locality (long strides, large working sets) * Stride prefetching (software vs hardware: complexity, dynamic information) * Irregular accesses (hash tables, linked structures?) * Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows) * Run-ahead execution (Generating cache misses that can be serviced in parallel) * Problems with run-ahead - length of run-ahead - what if new cache-miss is dependent on original miss? - Branch mispredictions and miss dependent branches * DRAM bank organization * Tolerating memory latencies - Caching - Prefetching - Multi-threading - Out of order execution * Fine-grained multi-theading (design and costs) * Causes of inefficiency in Run-ahead (energy consumption) * Breaking dependence - address prediction (AVD prediction) ===== Lecture 11 ===== OOO wrap-up and Advanced Caching * Dual Core Execution (DCE) * Comparison between run ahead and DCE - Lag between the front and the back cores - controlled by result queue sizing * Slipstreaming * SMT Architectures for slipstreaming instead of 2 separate cores * Result queue length in DCE * Store-Load dependencies * Store buffer design - Content associative, age ordered list of stores * Memory Disambiguation - Load dependence/independence on previous stores - Store/Load dependence prediction * Speculative execution and data coherence - Load buffer * Research issues in OoO - Scalable and energy-efficient instruction windows - Packing more MLP into a small window * OOO in Multi core systems - Memory system contention - bigger issue - Multiple cores to perform OOO - Asymmetric Multi-cores * Symmetric vs Asymmetric multi cores - Accelerating critical sections - Core fusion * Inclusion/Exclusion * Multi-level caching ===== Lecture 12 ===== Advanced Caching * Handling writes - Write-back - Write-through - Write allocate/no allocate * Instruction/Data Caching * Cache Replacement Policies - Random - FIFO - Least Recently Used - Not Most Recently Used - Least Frequently used * LRU Vs Random - Random as good as LRU for most practical workloads * Optimal Replacement Policy * MLP aware cache replacement * Cache Performance - Reducing miss rate - Reducing miss latency/cost * Cache Parameters - Cache size vs hit rate - Block size - Large Blocks - Critical words - Large blocks - bandwidth wastage - Sub blocking - Associativity - Power of 2 associativity? - Hybrid Replacement policies - Sampling based hybrid (random/LRU) replacement * Cache Misses - Compulsory - Conflict - Capacity - Coherence * Cache aware schedulers - cache affinity based application mapping * Victim caches * Hashing - Randomizing index functions * Pseudo associativity - serial cache lookup ===== Lecture 13 ==== More Caching * Speculative partial tag comparison * Skewed associative caches - Randomizing the index for different ways * Improving hit rate in software - Loop interchange - Row major, Column major - Blocking - Loop fusion, Array merging - Data structure layout - Packing frequently used fields in arrays * Handling multiple outstanding misses - Non blocking caches - Miss Status Handling Registers (MSHR) - Accessing MSHRs * Reducing miss latency through software - Compiler level reordering of loops - Software prefetching * Handling multiple accesses in a cycle - True/Virtual multiporting - Banking/Interleaving ===== Lecture 14 ==== Prefetching * Compulsory/conflict/capacity misses and prefetching * Coherence misses and prefetching * False sharing - Word/byte based coherence - Value prediction/Speculative execution * Prefetching and correctness * What/When/Where/How - Accuracy - Timeliness - Coverage - Prefetch buffers - Skewing prefetches towards demand fetches - Software/Hardware/Execution-based prefetchers * Software prefetching - Binding/Non-binding prefetches - Prefetching during pointer chasing - x86 prefetch instructions - prefetching into different levels of cache - Handling of prefetches that cause TLB misses/page faults - Compiler driven prefetching - Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load * Hardware prefetching - Next line prefetchers - Stride prefetchers - Instruction based stride prefetchers - Stream buffers - Locality based prefetching * Prefetcher performance - Accuracy - Coverage - Timeliness - Aggressiveness - Prefetcher distance - Prefetcher degree * Irregular prefetching - Markov prefetchers ===== Lecture 15 ==== Prefetching (wrap up) * Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D) * Irregular patterns (indirect array accesses, linked structures) - markov prefetching - linked lists or trees - markov prefetchers vs stride prefetches * Content directed prefetching - pointer based structures - identifying pointers (software mechanism/hardware prediction) - compiler analysis to provide hints for useful prefetches * Hybrid prefetchers * Execution based prefetchers - pre-execution thread for creating prefetches for the main program - determining when to start the pre-execution - similar idea for branch prediction ===== Lecture 16 ==== * Prefetching in multicores * Importance of prefetch efficiency * Issues with local prefefetcher throttling * Hierarchical prefetcher throttling * Cache coherence - Snoopy cache coherence * Shared caches in multicores * Utility based cache partitioning ===== Lecture 17 ==== * Software based cache management - Thread scheduling - Page coloring - Dynamic partitioning through page recoloring * Cache placement - Insertion - Re-insertion - Circular reference model - Dynamic insertion policy - LRU and Bimodal insertion ===== Lecture 19 ==== Main memory system * Memory hierarchy * SRAM/DRAM cell structures * Memory bank organization * Page mode DRAM - Bank operation * Basic DRAM operation - Controller latency - Bank latency * DRAM chips/DIMMs * DRAM channels * Address mapping/interleaving * Bank mapping randomization * DRAM refresh * DRAM controller issues - Memory controller placement ===== Lecture 20 ===== * DRAM controller functions - Refresh - Scheduling * DRAM Scheduling policies - FCFS - FR-FCFS * Row buffer management policies - Open row - Closed row * DRAM controller design - Machine learning - a possibility * DRAM power states * Inter-thread interference interference in DRAM * Multi-core DRAM controllers - Stall-time fairness - Unfairness - Estimating alone runtime - Providing system software support - Parallelism aware batch scheduling - Request batching - Within batch scheduling ===== Lecture 21 ==== Super scalar processing I * Types of parallelism - Task - Thread - Instruction * Fetch stage - Instruction alignment Issue - Solution - Split cache line fetch - Fetch break - branches in the fetch block * Fetch break solutions - Short distance predicted-taken branch - Basic block reordering - Super block code optimization - Trace cache ===== Lecture 22 ==== Super scalar processing II * Trace Caches - Multiple branch prediction aliasing issue - Inactive issue - Promoting highly biased branches to static branches - Saturating counters - Fill unit optimizations - Making highly biased branch paths atomic - Redundancy - Solution : Block based trace cache - An Enhanced Instruction cache vs a trace cache - Pentium 4 trace cache * Block structured ISA - Enlarged block branches - faults - Super block vs Block structured ISAs * Decode in superscalar processing - Predecoding - Decode cache - CISC to RISC translation in hardware - Micro code sequencing - Pentium 4 decoders - Prentium pro decoders - simple decoders - Complex decoder - Micro op sequencer - Instruction buffering fetch and decode ===== Lecture 23 ==== Superscalar Processing III * Renaming multiple instructions - dependency check logic (n^2 comparators) - help from compiler * ensure instructions are independent (difficult for wide fetches) * hardware-software co-design to simplify dependency logic * Dispatching multiple instructions - wakeup logic (compare all tags in reservation station with all the tags that are broadcast) - select logic (hierarchical tree based selection) * Execute - enough execution units - enough forwarding paths (broadcast tag/value to all functional units) * Reducing dispatch+bypass delays - clustering (divide window into multiple clusters) - intra-cluster bypass is fast - inter-cluster bypass can be slow * Register file - need multiple reads/writes per cycle - Replicate or partition the register files - using block-structured ISA * Retirement - updating architectural register map ===== Lecture 24 ==== Control Flow * Problem of branches * Types * conditional, unconditional, call, return, indirect branches * Handling conditional branches * Predicate combining * condition codes vs condition registers * Delayed branching * Fine-grained multi-threading * Branch prediction * predicting if an instruction is a branch (predecoding) * predicting the direction of the branch * predicting the target address of a branch * Static branch predition * always taken/not taken * backward taken, forward not taken * by compiler based on profiling * Dynamic branch prediction * last time predictor * history based predictors * two-level predictors ===== Lecture 25 ==== Control Flow - II * 2-bit counter based prediction * Global branch prediction * Global branch correlation * Global two-level prediction - Global history register * Local two-level prediction - Pattern history table - Interference in the pattern history table - Randomizing the index into the pattern history table - Agree prediction * Alpha 21264 Tournament Predictor * Perceptron branch predictor - Perceptron - learns a target boolean function of N inputs * Call and Return Prediction * Indirect branch prediction - Virtual Conditional Branch prediction * Branch prediction issues - Need to know a branch as soon as it is fetched - Latency - State recovery upon misprediction * Predicated execution ==== Lecture 26 ==== Control Flow - III & Concurrency * Predicated Execution - Predication decisions at the compiler - Rename stage modifications * Limitations of predication - Adaptivity - Complex Control Flow Graphs - ISA support * Wish branches - Wish jump/join - Wish loop * Wish branches vs Predicated Execution * Wish branches vs Branch prediction * Diverge-Merge Processor * Dynamic-Hammock * Multi-path Execution * Research issues in control flow handling - Hardware/software cooperation - Fetch gating - Recycling useful work done on wrong path Concurrency * Classification of machines - SISD - SIMD - MIMD * Decoupled Access/Execute * Astronautics ZS-1 * Loop unrolling ==== Lecture 27 ==== VLIW * Each VLIW instruction - a bundle of independent instructions (identified by compiler) * Each instruction bundle executed by hardware in lockstep * Commercial VLIW machines - TIC6000, Trimedia, STMicro * Intel IA-64 - Partially VLIW * Encoding VLIW NOPs * Static Instruction Scheduling for VLIW * Code motion - Safety & Legality * Trace scheduling * List scheduling * Super block scheduling * Hyperblock scheduling * The Intel IA-64 architecture - No lock step execution of a bundle - Specify dependencies between instructions within a bundle - Template bits * What hinder static mode motion? - Exceptions - Loads/Stores