====== Buzzwords ====== Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. ===== Lecture 1 (1/12 Mon.) ===== * Level of transformation * Algorithm * System software * Compiler * Cross abstraction layers * Tradeoffs * Caches * DRAM/memory controller * DRAM banks * Row buffer hit/miss * Row buffer locality * Unfairness * Memory performance hog * Shared DRAM memory system * Streaming access vs. random access * Memory scheduling policies * Scheduling priority * Retention time of DRAM * Process variation * Retention time profile * Power consumption * Bloom filter * Hamming code * Hamming distance * DRAM row hammer ===== Lecture 2 (1/14 Wed.) ===== * Moore's Law * Algorithm --> step-by-step procedure to solve a problem * in-order execution * out-of-order execution * technologies that are available on cellphones * new applications that are made available through new computer architecture techniques * more data mining (genomics/medical areas) * lower power (cellphones) * smaller cores (cellphones/computers) * etc. * Performance bottlenecks in a single thread/core processors * multi-core as an alternative * Memory wall (a part of scaling issue) * Scaling issue * Transistor are getting smaller * Key components of a computer * Design points * Design processors to meet the design points * Software stack * Design decisions * Datacenters * Reliability problems that cause errors * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) * Pre-paradigm science * Normal science * Revolutionary science * Components of a computer * Computation * Communication * Storage * DRAM * NVRAM (Non-volatile memory): PCM, STT-MRAM * Storage (Flash/Harddrive) * Von Neumann Model (Control flow model) * Stored program computer * Properties of Von Neumann Model: Stored program, sequential instruction processing * Unified memory * When does an instruction is being interpreted as an instruction (as oppose to a datum)? * Program counter * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS * Data flow model * Data flow machine * Data flow graph * Operands * Live-outs/Live-ins * Different types of data flow nodes (conditional/relational/barrier) * How to do transactional transaction in dataflow? * Example: bank transactions * Tradeoffs between control-driven and data-driven * What are easier to program? * Which are easy to compile? * What are more parallel (does that mean it is faster?) * Which machines are more complex to design? * In control flow, when a program is stop, there is a pointer to the current state (precise state). * ISA vs. Microarchitecture * Semantics in the ISA * uArch should obey the ISA * Changing ISA is costly, can affect compatibility. * Instruction pointers * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level * Conceptual techniques * Pipelining * Multiple instructions at a time * Out-of-order executions * etc. * Design techniques * Adder implementation (Bit serial, ripple carry, carry lookahead) * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism) * Microprocessor: ISA + uArch + circuits * What are a part of the ISA? Instructions, memory, etc. * Things that are visible to the programmer/software * What are not a part of the ISA? (what goes inside: uArch techniques) * Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex ===== Lecture 3 (1/17 Fri.) ===== * Microarchitecture * Three major tradeoffs of computer architecture * Macro-architecture * LC-3b ISA * Unused instructions * Bit steering * Instruction processing style * 0,1,2,3 address machines * Stack machine * Accumulator machine * 2-operand machine * 3-operand machine * Tradeoffs between 0,1,2,3 address machines * Postfix notation * Instructions/Opcode/Operand specifiers (i.e. addressing modes) * Simply vs. complex data type (and their tradeoffs) * Semantic gap and level * Translation layer * Addressability * Byte/bit addressable machines * Virtual memory * Big/little endian * Benefits of having registers (data locality) * Programmer visible (Architectural) state * Programmers can access this directly * What are the benefits? * Microarchitectural state * Programmers cannot access this directly * Evolution of registers (from accumulators to registers) * Different types of instructions * Control instructions * Data instructions * Operation instructions * Addressing modes * Tradeoffs (complexity, flexibility, etc.) * Orthogonal ISA * Addressing modes that are orthogonal to instruction types * I/O devices * Vectored vs. non-vectored interrupts * Complex vs. simple instructions * Tradeoffs * RISC vs. CISC * Tradeoff * Backward compatibility * Performance * Optimization opportunity * Translation ===== Lecture 4 (1/21 Wed.) ===== * Fixed vs. variable length instruction * Huffman encoding * Uniform vs. non-uniform decode * Registers * Tradeoffs between number of registers * Alignments * How does MIPS load words across alignment the boundary ===== Lecture 5 (1/26 Mon.) ===== * Tradeoffs in ISA: Instruction length * Uniform vs. non-uniform * Design point/Use cases * What dictates the design point? * Architectural states * uArch * How to implement the ISA in the uArch * Different stages in the uArch * Clock cycles * Multi-cycle machine * Datapath and control logic * Control signals * Execution time of instructions/program * Metrics and what do they means * Instruction processing * Fetch * Decode * Execute * Memory fetch * Writeback * Encoding and semantics * Different types of instructions (I-type, R-type, etc.) * Control flow instructions * Non-control flow instructions * Delayed slot/Delayed branch * Single cycle control logic * Lockstep * Critical path analysis * Critical path of a single cycle processor * What is in the control signals? * Combinational logic & Sequential logic * Control store * Tradeoffs of a single cycle uarch * Design principles * Common case design * Critical path design * Balanced designs * Dynamic power/Static power * Increases in power due to frequency ===== Lecture 6 (1/28 Mon.) ===== * Design principles * Common case design * Critical path design * Balanced designs * Multi cycle design * Microcoded/Microprogrammed machines * States * Translation from one state to another * Microinstructions * Microsequencing * Control store - Product control signals * Microsequencer * Control signal * What do they have to control? * Instruction processing cycle * Latch signals * State machine * State variables * Condition code * Steering bits * Branch enable logic * Difference between gating and loading? (write enable vs. driving the bus) * Memory mapped I/O * Hardwired logic * What control signals come from hardwired logic? * Variable latency memory * Handling interrupts * Difference between interrupts and exceptions * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) * Updating machine behavior * Horizontal microcode * Vertical microcode * Primitives ===== Lecture 7 (1/30 Fri.) ===== * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) * Updating machine behavior * Horizontal microcode * Vertical microcode * Primitives * nanocode and millicode * what are the differences between nano/milli/microcode * microprogrammed vs. hardwire control * Pipelining * Limitations of the multi-programmed design * Idle resources * Throughput of a pipelined design * What dictates the throughput of a pipelined design? * Latency of the pipelined design * Dependency * Overhead of pipelining * Latch cost? * Data forwarding/bypassing * What are the ideal pipeline? * External fragmentation * Issues in pipeline designs * Stalling * Dependency (Hazard) * Flow dependence * Output dependence * Anti dependence * How to handle them? * Resource contention * Keeping the pipeline full * Handling exception/interrupts * Pipeline flush * Speculation ===== Lecture 8 (2/2 Mon.) ===== * Interlocking * Multipath execution * Fine grain multithreading * No-op (Bubbles in the pipeline) * Valid bits in the instructions * Branch prediction * Different types of data dependence * Pipeline stalls * bubbles * How to handle stalls * Stall conditions * Stall signals * Dependences * Distant between dependences * Data forwarding/bypassing * Maintaining the correct dataflow * Different ways to design data forwarding path/logic * Different techniques to handle interlockings * SW based * HW based * Profiling * Static profiling * Helps from the software (compiler) * Superblock optimization * Analyzing basic blocks * How to deal with branches? * Branch prediction * Delayed branching (branch delay slot) * Forward control flow/backward control flow * Branch prediction accuracy * Profile guided code positioning * Based on the profile info. position the code based on it * Try to make the next sequential instruction be the next inst. to be executed * Predicate combining (combine predicate for a branch instruction) * Predicated execution (control dependence becomes data dependence) ===== Lecture 9 (2/4 Wed.) ===== * Definition of basic blocks * Control flow graph * Delayed branching * benefit? * What does it eliminates? * downside? * Delayed branching in SPARC (with squashing) * Backward compatibility with the delayed slot * What should be filled in the delayed slot * How to ensure correctness * Fine-grained multithreading * fetch from different threads * What are the issues (what if the program doesn't have many threads) * CDC 6000 * Denelcor HEP * No dependency checking * Inst. from different thread can fill-in the bubbles * Cost? * Simultaneuos multithreading * Branch prediction * Guess what to fetch next. * Misprediction penalty * Need to guess the direction and target * How to perform the performance analysis? * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction. * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. * How many extra instructions are being fetched? * What is the performance degradation? * How to reduce the miss penalty? * Predicting the next address (non PC+4 address) * Branch target buffer (BTB) * Predicting the address of the branch * Global branch history - for directions * Can use compiler to profile and get more info * Input set dictates the accuracy * Add time to compilation * Heuristics that are common and doesn't require profiling. * Might be inaccurate * Does not require profiling * Static branch prediction * Programmer provides pragmas, hinting the likelihood of taken/not taken branch * For example, x86 has the hint bit * Dynamic branch prediction * Last time predictor * Two bits counter based prediction * One more bit for hysteresis ===== Lecture 10 (2/6 Fri.) ===== * Branch prediction accuracy * Why are they very important? * Differences between 99% accuracy and 98% accuracy * Cost of a misprediction when the pipeline is very deep * Global branch correlation * Some branches are correlated * Local branch correlation * Some branches can depend on the result of past branches * Pattern history table * Record global taken/not taken results. * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?) * One-level branch predictor * What information are used * Two-level branch prediction * What entries do you keep in the global history? * What entries do you keep in the local history? * How many table? * Cost when training a table * What are the purposes of each table? * Potential problems of a two-level history * GShare predictor * Global history predictor is hashed with the PC * Store both GHP and PC in one combined information * How do you use the information? Why does the XOR result still usable? * Warmup cost of the branch predictor * Hybrid solution? Fast warmup is used first, then switch to the slower one. * Tournament predictor (Alpha 21264) * Predicated execution - eliminate branches * What are the tradeoffs * What if the block is big (can lead to execution a lot of useless work) * Allows easier code optimization * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block * Reduce control dependences * Need ISA support * Wish branches * Compiler generate both predicated and non-predicated codes * HW design which one to use * Use branch prediction on an easy to predict code * Use predicated execution on a hard to predict code * Compiler can be more aggressive in optmizing the code * What are the tradeoffs (slide# 47) * Multi-path execution * Execute both paths * Can lead to wasted work * VLIW * Superscalar ===== Lecture 11 (2/11 Wed.) ===== * Geometric GHR length for branch prediction * Perceptron branch predictor * Multi-cycle executions (Different functional units take different number of cycles) * Instructions can retire out-of-order * How to deal with this case? Stall? Throw exceptions if there are problems? * Exceptions and Interrupts * When they are handled? * Why are some interrupts should be handled right away? * Precise exception * arch. state should be consistent before handling the exception/interrupts * Easier to debug (you see the sequential flow when the interrupt occurs) * Deterministic * Easier to recover from the exception * Easier to restart the processes * How to ensure precise exception? * Tradeoffs between each method * Reorder buffer * Reorder results before they are visible to the arch. state * Need to preserve the sequential semantic and data * What are the information in the ROB entry * Where to get the value from (forwarding path? reorder buffer?) * Extra logic to check where the youngest instructions/value is * Content addressable search (CAM) * A lot of comparators * Different ways to simplify the reorder buffer * Register renaming * Same register refers to independent values (lacks of registers) * Where does the exception happen (after retire) * History buffer * Update the register file when the instruction complete. Unroll if there is an exception. * Future file (commonly used, along with reorder buffer) * Keep two set of register files * An updated value (Speculative), called future file * A backup value (to restore the state quickly * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone) * Branch misprediction resembles Exception * The difference is that branch misprediction is not visible to the software * Also much more common (say, divide by zero vs. a mispredicted branch) * Recovery is similar to exception handling * Latency of the state recovery * What to do during the state recovery * Checkpointing * Advantages? ===== Lecture 12 (2/13 Fri.) ===== * Renaming * Register renaming table * Predictor (branch predictor, cache line predictor ...) * Power budget (and its importance) * Architectural state, precise state * Memory dependence is known dynamically * Register state is not shared across threads/processors * Memory state is shared across threads/processors * How to maintain speculative memory states * Write buffers (helps simplify the process of checking the reorder buffer) * Overall OoO mechanism * What are other ways of eliminating dispatch stalls * Dispatch when the sources are ready * Retired instructions make the source available * Register renaming * Reservation station * What goes into the reservation station * Tags required in the reservation station * Tomasulo's algorithm * Without precise exception, OoO is hard to debug * Arch. register ID * Examples in the slides * Slides 28 --> register renaming * Slides 30-35 --> Exercise (also on the board) * This will be useful for the midterm * Register aliasing table * Broadcasting tags * Using dataflow ===== Lecture 13 (2/16 Mon.) ===== * OoO --> Restricted Dataflow * Extracting parallelism * What are the bottlenecks? * Issue width * Dispatch width * Parallelism in the program * What does it mean to be restricted data flow * Still visible as a Von Neumann model * Where does the efficiency come from? * Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense? * Load/store handling * Would like to schedule them out of order, but make them visible in-order * When do you schedule the load/store instructions? * Can we predict if load/store are dependent? * This is one of the most complex structure of the load/store handling * What information can be used to predict these load/store optimization? * Centralized vs. distributed? What are the tradeoffs? * How to handle when there is a misprediction/recovery * OoO + branch prediction? * Speculatively update the history register * When do you update the GHR? * Token dataflow arch. * What are tokens? * How to match tokens * Tagged token dataflow arch. * What are the tradeoffs? * Difficulties? ===== Lecture 14 (2/18 Wed.) ===== * SISD/SIMD/MISD/MIMD * Array processor * Vector processor * Data parallelism * Where does the concurrency arise? * Differences between array processor vs. vector processor * VLIW * Compactness of an array processor * Vector operates on a vector of data (rather than a single datum (scalar)) * Vector length (also applies to array processor) * No dependency within a vector --> can have a deep pipeline * Highly parallel (both instruction level (ILP) and memory level (MLP)) * But the program needs to be very parallel * Memory can be the bottleneck (due to very high MLP) * What does the functional units look like? Deep pipeline and simpler control. * CRAY-I is one of the examples of vector processor * Memory access pattern in a vector processor * How do the memory accesses benefit the memory bandwidth? * Memory level parallelism * Stride length vs. the number of banks * stride length should be relatively prime to the number of banks * Tradeoffs between row major and column major --> How can the vector processor deals with the two * How to calculate the efficiency and performance of vector processors * What if there are multiple memory ports? * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) * Helps handling sparse matrices * Conditional operation * Structure of vector units * How to automatically parallelize code through the compiler? * This is a hard problem. Compiler does not know the memory address. * What do we need to ensure for both vector and array processor? * Sequential bottleneck * Amdahl's law * Intel MMX --> An example of Intel's approach to SIMD * No VLEN, use OpCode to define the length * Stride is one in MMX * Intel SSE --> Modern version of MMX ===== Lecture 15 (2/20 Fri.) ===== * GPU * Warp/Wavefront * A bunch of threads sharing the same PC * SIMT * Lanes * FGMT + massively parallel * Tolerate long latency * Warp based SIMD vs. traditional SIMD * SPMD (Programming model) * Single program operates on multiple data * can have synchronization point * Many scientific applications are programmed in this manner * Control flow problem (branch divergence) * Masking (in a branch, mask threads that should not execute that path) * Lower SIMD efficiency * What if you have layers of branches? * Dynamic warp formation * Combining threads from different warps to increase SIMD utilization * This can cause memory divergence * VLIW * Wide fetch * IA-64 * Tradeoffs * Simple hardware (no dynamic scheduling, no dependency checking within VLIW) * A lot of loads at the compiler level * Decoupled access/execute * Limited form of OoO * Tradeoffs * How to street the instruction (determine dependency/stalling)? * Instruction scheduling techniques (static vs. dynamic) * Systolic arrays * Processing elements transform data in chains * Develop for image processing (for example, convolution) * Stage processing ===== Lecture 16 (2/23 Mon.) ===== * Systolic arrays * Processing elements transform data in chains * Can be arrays of multi-dimensional processing elements * Develop for image processing (for example, convolution) * Can be use to break stages in pipeline programs, using a set of queues and processing elements * Can enable high concurrency and good for regular programs * Very special purpose * The warp computer * Static instruction scheduling * How do we find the next instruction to execute? * Live-in and live-out * Basic blocks * Rearranging instructions in the basic block * Code movement from one basic block to another * Straight line code * Independent instructions * How to identify independent instructions * Atomicity * Trace scheduling * Side entrance * Fixed up code * How scheduling is done * Instruction scheduling * Prioritization heuristics * Superblock * Traces with no side-entrance * Hyperblock * BS-ISA * Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA ===== Lecture 17 (2/25 Wed.) ===== * IA-64 * EPIC * IA-64 instruction bundle * Multiple instructions in the bundle along with the template bit * Template bits * Stop bits * Non-faulting loads and exception propagation * Aggressive ST-LD reordering * Physical memory system * Ideal pipelines * Ideal cache * More capacity * Fast * Cheap * High bandwidth * DRAM cell * Cheap * Sense the perturbation through sense amplifier * Slow and leaky * SRAM cell (Cross coupled inverter) * Expensice * Fast (easier to sense the value in the cell) * Memory bank * Read access sequence * DRAM: Activate -> Read -> Precharge (if needed) * What dominate the access latency for DRAM and SRAM * Scaling issue * Hard to scale the scale to be small * Memory hierarchy * Prefetching * Caching * Spatial and temporal locality * Cache can exploit these * Recently used data is likely to be accessed * Nearby data is likely to be accessed * Caching in a pipeline design * Cache management * Manual * Data movement is managed manually * Embedded processor * GPU scratchpad * Automatic * HW manage data movements * Latency analysis * Based on the hit and miss status, next level access time (if miss), and the current level access time * Cache basics * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc. * Cache access * How to access tag and data (in parallel vs serially) * How do tag and index get used? * Modern processors perform serial access for higher level cache (L3 for example) to save power * Cost and benefit of having more associativity * Given the associativity, which block should be replace if it is full * Replacement policy * Random * Least recently used (LRU) * Least frequently used * Least costly to refetch * etc. * How to implement LRU * How to keep track of access ordering * Complexity increases rapidly * Approximate LRU * Victim and next Victim policy ===== Lecture 18 (2/27 Fri.) ===== * Tag store and data store * Cache hit rate * Average memory access time (AMAT) * AMAT vs. Stall time * Cache basics * Direct mapped vs. associative cache * Set/block (line)/Placement/replacement * How do tag and index get used? * Full associativity * Set associative cache * insertion, promotion, eviction (replacement) * Various replacement policies * How to implement LRU * How to keep track of access ordering * Complexity increases rapidly * Approximate LRU * Victim and next Victim policy * Set thrashing * Working set is bigger than the associativity * Belady's OPT * Is this optimal? * Complexity? * DRAM as a cache for disk * Handling writes * Write through * Need a modified bit to make sure accesses to data got the updated data * Write back * Simpler, no consistency issues * Sectored cache * Use subblock * lower bandwidth * more complex * Instruction vs data cache * Where to place instructions * Unified vs. separated * In the first level cache * Cache access * First level access * Second level access * When to start the second level access * Cache performance * capacity * block size * associativity * Classification of cache misses ===== Lecture 19 (03/02 Mon.) ===== * Subblocks * Victim cache * Small, but fully assoc. cache behind the actual cache * Cached misses cache block * Prevent ping-ponging * Pseudo associativity * Simpler way to implement associative cache * Skewed assoc. cache * Different hashing functions for each way * Restructure data access pattern * Order of loop traversal * Blocking * Memory level parallelism * Cost per miss of a parallel cache miss is less costly compared to serial misses * MSHR * Keep track of pending cache * Think of this as the load/store buffer-ish for cache * What information goes into the MSHR? * When do you access the MSHR? * Memory banks * Shared caches in multi-core processors ===== Lecture 20 (03/04 Wed.) ===== * Virtual vs. physical memory * System's management on memory * Benefits * Problem: physical memory has limited size * Mechanisms: indirection, virtual addresses, and translation * Demand paging * Physical memory as a cache * Tasks of system SW for VM * Serving a page fault * Address translation * Page table * PTE (page table entry) * Page replacement algorithm * CLOCK algo. * Inverted page table * Page size trade-offs * Protection * Multi-level page tables * x86 implementation of page table * TLB * Handling misses * When to do address translation? * Homonym and Synonyms * Homonym: Same VA but maps to different PA with multiple processes * Synonyms: Multiple VAs map to the same PA * Shared libraries, shared data, copy-on-write * Virtually indexed vs. physically indexed * Virtually tagged vs. physically tagged * Virtually indexed physically tagged * Can these create problems when we have the cache * How to eliminate these problems? * Page coloring * Interaction between cache and TLB ===== Lecture 21 (03/23 Mon.) ===== * DRAM scaling problem * Demands/trends affecting the main memory * More capacity * Low energy * More bandwidth * QoS * ECC in DRAM * Multi-porting * Virtual multi-porting * Time-share the port, not too scalable but cheap * True multiporting * Multiple cache copies * Alignment * Banking * Can have bank conflict * Extra interconnects across banks * Address mapping can mitigate bank conflict * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) * Bank mapping * How to avoid bank conflicts? * Channel mapping * Address mapping to minimize bank conflict * Page coloring * Virtual to physical mapping that can help reducing conflicts * Accessing DRAM * Row bits * Column bits * Addressibility * DRAM has its own clock * Sense amplifier * Bit lines * Word lines * DRAM (2T) vs. SRAM (6T) * Cost * Latency * Interleaving in DRAM * Effects from address mapping on memory interleaving * Effects from memory access patterns from the program on interleaving * DRAM Bank * To minimize the cost of interleaving (Shared the data bus and the command bus) * DRAM Rank * Minimize the cost of the chip (a bundle of chips operated together) * DRAM Channel * An interface to DRAM, each with its own ranks/banks * DRAM Chip * DIMM * More DIMM adds the interconnect complexity * List of commands to read/write data into DRAM * Activate -> read/write -> precharge * Activate moves data into the row buffer * Precharge prepare the bank for the next access * Row buffer hit * Row buffer conflict * Scheduling memory requests to lower row conflicts * Burst mode of DRAM * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits * Address mapping * Row interleaved * Cache block interleaved * Memory controller * Sending DRAM commands * Periodically send commands to refresh DRAM cells * Ensure correctness and data integrity * Where to place the memory controller * On CPU chip vs. at the main memory * Higher BW on-chip * Determine the order of requests that will be serviced in DRAM * Request queues that hold requests * Send requests whenever the request can be sent to the bank * Determine which command (across banks) should be sent to DRAM ===== Lecture 22 (03/25 Wed.) ===== * Flash controller * Flash memory * Garbage collection in flash * Overhead in flash memory * Erase (off the critical path, but takes a long time) * Different types of DRAM * DRAM design choices * Cost/density/latency/BW/Yield * Sense Amplifier * How do they work * Dual data rate * Subarray * Rowclone * Moving bulk of data from one row to others * Lower latency and BW when performing copies/zeroes out the data * TL-DRAM * Far segment * Near segment * What causes the long latency * Benefit of TL-DRAM * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) * List of commands to read/write data into DRAM * Activate -> read/write -> precharge * Activate moves data into the row buffer * Precharge prepare the bank for the next access * Row buffer hit * Row buffer conflict * Scheduling memory requests to lower row conflicts * Burst mode of DRAM * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits * Address mapping * Row interleaved * Cache block interleaved * Memory controller * Sending DRAM commands * Periodically send commands to refresh DRAM cells * Ensure correctness and data integrity * Where to place the memory controller * On CPU chip vs. at the main memory * Higher BW on-chip * Determine the order of requests that will be serviced in DRAM * Request queues that hold requests * Send requests whenever the request can be sent to the bank * Determine which command (across banks) should be sent to DRAM * Priority of demand vs. prefetch requests * Memory scheduling policies * FCFS * FR-FCFS * Try to maximize row buffer hit rate * Capped FR-FCFS: FR-FCFS with a timeout * Usually this is done in a command level (read/write commands and precharge/activate commands) * PAR-BS * Key benefits * stall time * shortest job first * STFM * ATLAS * TCM * Key benefits * Configurability * Fairness + performance at the same time * Robuestness isuees * Open row policy * Closed row policy * QoS * QoS issues in memory scheduling * Fairness * Performance guarantee ===== Lecture 23 (03/27 Fri.) ===== * Different ways to control interference in DRAM * Partitioning of resource * Channel partitioning: map applications that interfere with each other in a different channel * Keep track of application's characteristics * Dedicate a channel might waste the bandwidth * Need OS support to determine the channel bits * Source throttling * A controller throttle the core depends on the performance target * Example: Fairness via source throttling * Detect unfairness and throttle application that is interfering * How do you estimate slowdown? * Threshold based solution: hard to configure * App/thread scheduling * Critical threads usually stall the progress * Designing DRAM controller * Has to handle the normal DRAM operations * Read/write/refresh/all the timing constraints * Keep track of resources * Assign priorities to different requests * Manage requests to banks * Self-optimizing controller * Use machine learning to improve DRAM controller * A-DRM * Architecture aware DRAM * Multithread * synchronization * Pipeline programs * Producer consumer model * Critical path * Limiter threads * Prioritization between threads * Different power mode in DRAM * DRAM Refresh * Why does DRAM has to refresh every 64ms * Banks are unavailable during refresh * LPDDR mitigate this by using a per-bank refresh * Has to spend longer time with bigger DRAM * Distributed refresh: stagger refresh every 64 ms in a distributed manner * As oppose to burst refresh (long pause time) * RAIDR: Reduce DRAM refresh by profiling and binning * Some row do not have to be refresh very frequently * Profile the row * High temperature changes the retention time: need online profiling * Bloom filter * Represent set membership * Approximated * Can contain false positive * Better/more hash function helps eliminate this ===== Lecture 24 (03/30 Mon.) ===== * Simulation * Drawbacks of RTL simulations * Time consuming * Complex to develop * Hard to perform design explorations * Explore the design space quickly * Match the behavior of existing systems * Tradeoffs: speed, accuracy, flexibility * High-level simulation vs. detailed simulation * High-level simulation is faster, but lower accuracy * Controllers that works on multiple types of cores * Design problems: how to find a good scheduling policy on its own? * Self-optimizing memory controller: using machine learning * Can adapt to the applications * The complexity is very high * Tolerate latency can be costly * Instruction window is complex * Benefit also diminishes * Designing the buffers can be complex * A simpler way to tolerate out of order is desirable * Different sources that cause the core to stall in OoO * Cache miss * Note that stall happens if the inst. window is full * Scaling instruction window size is hard * It is better (less complex) to make the windows more efficient * Runahead execution * Try to optain MLP w/o increasing instruction windows * Runahead (i.e. execute ahead) when there is a long memory instruction * Long memory instruction stall processor for a while anyways, so it's better to make use out of it * Execute future instruction to generate accurate prefetches * Allow future data to be in the cache * How to support runahead execution? * Need a way to checkpoing the state when entering runahead mode * How to make executing in the wrong path useful? * Need runahead cache to handle load/store in Runahead mode (since they are speculative) ===== Lecture 25 (4/1 Wed.) ===== * More Runahead executions * How to support runahead execution? * Need a way to checkpoing the state when entering runahead mode * How to make executing in the wrong path useful? * Need runahead cache to handle load/store in Runahead mode (since they are speculative) * Cost and benefit of runahead execution (slide number 27) * Runahead can have inefficiency * Runahead period that are useless * Get rid of useless inefficient period * What if there is a dependent cache miss * Cannot be paralellized in a vanilla runahead * Can predict the value of the dependent load * How to predict the address of the load * Delta value information * Stride predictor * AVD prediction * Questions regarding prefetching * What to prefetch * When to prefetch * how do we prefetch * where to prefetch from * Prefetching can cause thrasing (evict a useful block) * Prefetching can also be useless (not being used) * Need to be efficient * Can cause memory bandwidth problem in GPU * Prefetch the whole block, more than one block, or subblock? * Each one of them has pros and cons * Big prefetch is more likely to waste bandwidth * Commonly done in a cache block granularity * Prefetch accuracy: fraction of useful prefetches out of all the prefetches * Prefetcher usually predict based on * Past knowledge * Compiler hints * Prefetcher has to prefetch at the right time * Prefetch that is too early might get evicted * It might also evict other useful data * Prefetch too late does not hide the whole memory latency * Previous prefetches at the same PC can be used as the history * Previous demand requests also is a good information to use for prefetches * Prefetch buffer * Place the prefetch data to avoid thrashing * Can treat demand/prefetch requests separately * More complex * Generally, demand block is more important * This means eviction should prefer prefetch block as oppose to demand block * Tradeoffs between where do we place the prefetcher * Look at L1 hits and misses * Look at L1 misses only * Look at L2 misses * Different access pattern affect accuracy * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) * Software vs. hardware vs. execution based prefetching * Software: ISA previde prefetch instructions, software utilize it * What information are useful * How to make sure the prefetch is timely * What if you have a pointer based structure * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) * Can be solved by hinting the nextnext and/or nextnextnext address * Hardware: Identify the pattern and prefetch * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) * Stride prefetcher * Predict strides, which is common in many programs * Cache block based or instruction based * Stream buffer design * Buffer the stream of accesses (next address) * Use the information to prefetch * What affect prefetcher performance * Prefetch distance * How far ahead should we prefetch * Prefetch degree * How many prefetches do we prefetch * Prefetcher performance * Coverage * Out of the demand requests, how many are actually from the prefetch request * Accuracy * Out of all the prefetch requests, how many are actually getting used * Timeliness * How much memory latency can we hide from the prefetch requests * Cache pullition * How much did the prefetcher cause misses in the demand misses? * Hard to quantify ===== Lecture 26 (4/3 Fri.) ===== * Feedback directed prefetcher * Use the result of the prefetcher as a feedback to the prefetcher * with accuracy, timeliness, polluting information * Markov prefetcher * Prefetch based on the previous history * Use markov model to predict * Pros: Can cover arbitary pattern (easy for link list traversal or trees) * Downside: High cost, cannot help with compulsory misses (no history) * Content directed prefetching * Indentify the content in memory for pointers (which is used as the address to prefetch * Not very efficient (hard to figure out which block is the pointer) * Software can give hints * Correlation table * Address correlation * Execution based prefetcher * Helper thread/speculative thread * Use another thread to pre-execute a program * Can be a software based or hardware based * Discover misses before the main program (to prefetch data in a timely manner) * How do you construct the helper thread * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 * Thread-based pre-execution * Error tolerance * Solution to errors * Tolerate errors * New interface, new design * Eliminate or minimize errors * New technology, system-wide rethinking * Embrace errors * Map data that can tolerate errors to error-prone area * Hybrid memory systesm * Combining multiple memory technology together * What can emerging technology help? * Scalability * Lower the cost * Energy efficiency * Possible solutions to the scaling problem * Less leakage DRAM * Heterogeneous DRAM (TL-DRAM, etc.) * Add more functionality to DRAM * Denser design (3D stack) * Different technology * NVM * Charge vs. resistice memory * How data is written? * How to read the data? * Non volatile memory * Resistive memory * PCM * Inject current to change the phase * Scales better than DRAM * Multiple bits per cell * Wider resistence range * No refresh is needed * Downside: Latency and write endurance * STT-MRAM * Inject current to change the polarity * Memristor * Inject current to change the structure * Pros and cons between different technologies * Persistency - data stay there even without power * Unified memory and storage management (persistent data structure) - Single level store * Improve energy and performance * Simplify programming model * Different design options for DRAM + NVM * DRAM as a cache * Place some data in DRAM and other in PCM * Based on the characteristics * Frequently accessed data that need lower write latency in DRAM ===== Lecture 27 (4/6 Mon.) ===== * Flynn's taxonomy * Parallelism * Reduces power consumption (P ~ CV^2F) * Better cost efficiency and easier to scale * Improves dependability (in case the other core is faulty * Different types of parallelism * Instruction level parallelism * Data level parallelism * Task level parallelism * Task level parallelism * Partition a single, potentially big, task into multiple parallel sub-task * Can be done explicitly (parallel programming by the programmer) * Or implicitly (hardware partitions a single thread speculatively) * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement) * Loosely coupled multiprocessor * No shared global address space * Message passing to communicate between different sources * Simple to manage memory * Tightly coupled multiprocessor * Shared global address space * Need to ensure consistency of data * Programming issues * Hardware-based multithreading * Coarse grained * Find grained * Simultaneous: Dispatch instruction from multiple threads at the same time * Parallel speedup * Superlinear speedup * Utilization, Redundancy, Efficiency * Amdahl's law * Maximum speedup * Parallel portion is not perfect * Serial bottleneck * Synchronization cost * Load balance * Some threads has more work, requires more time to hit the sync. point * Critical sections * Enforce mutually exclusive access to shared data * Issues in parallel programming * Correctness * Synchronization * Consistency ===== Lecture 28 (4/8 Wed.) ===== * Ordering of instructions * Maintaining memory consistency when there are multiple threads and shared memory * Need to ensure the semantic is not changed * Making sure the shared data is properly locked when used * Support mutual exclusion * Ordering depends on when each processor is executed * Debugging is also difficult (non-deterministic behavior) * Dekker's algorithm * Inconsistency -- the two processors did NOT see the same order of operations to memory * Sequential consistency * Multiple correct global orders * Two issues: * Too conservative/strict * Performance limiting * Weak consistency: global ordering when sync * programmer hints where the synchronizations are * Memory fence * More burden on the programmers * Cache coherence * Can be done in the software level or hardware level * Snoop-based coherence * A simple protocol with two states by broadcasting reads/writes on a bus * Maintaining coherence * Needs to provide 1) write propagation and 2) write serialization * Update vs. Invalidate * Two cache coherence methods * Snoopy bus * Bus based, single point of serialization * More efficient with small number of processors * Processors snoop other caches read/write requests to keep the cache block coherent * Directory * Single point of serialization per block * Directory coordinates the coherency * More scalable * The directory keeps track of where the copies of each block resides * Supplies data on a read * Invalidates the block on a write * Has an exclusive state ===== Lecture 29 (4/10 Fri.) ===== * MSI coherent protocol * The problem: unnecessary broadcasts of invalidations * MESI coherent protocol * Add the exclusive state: this is the only cache copy and it is a clean state to MSI * Multiple invalidation tradeoffs * Problem: memory can be unnecessarily updated * A possible owner state (MOESI) * Tradeoffs between snooping and directory based coherence protocols * Slide 31 has a good summary * Directory: data structures * Bit vectors vs. linked lists * Scalability of directories * Size? Latency? Thousand of nodes? Best of both snooping and directory? ===== Lecture 30 (4/13 Mon.) ===== * In-memory computing * Design goals of DRAM * DRAM structures * Banks * Capacitors and sense amplifiers * Trade-offs b/w number of sense amps and cells * Width of bank I/O vs. row size * DRAM operations * ACTIVATE, READ/WRITE, and PRECHARGE * Trade-offs * Latency * Bandwidth: Chip vs. rank vs. bank * What's the benefit of having 8 chips? * Parallelism * RowClone * What are the problems? * Copying b/w two rows that share the same sense amplifier * System software support * Bitwise AND/OR ===== Lecture 31 (4/15 Wed.) ===== * Application slowdown * Interference between different applications * Applications' performance depends on other applications that they are running with * Predictable performance * Why are they important? * Applications that need predictibility * How to predict the performance? * What information are useful? * What need to be guarantee? * How to estimate the performance when running with others? * Easy, just measure the performance while it is running. * How to estimate the performance when the application is running by itself. * Hard if there is no profiling. * The relationship between memory service rate and the performance. * Key assumption: applications are memory bound * Behavior of memory-bound applications * With and without interference * Memory phase vs. compute phase * MISE * Estimating slowdown using request service rate * Inaccuracy when measuring request service rate alone * Non-memory-bound applications * Control slowdown and provide soft guarantee * Taking into account of the shared cache * MISE model + cache resource management * Aug tag store * Separate tag store for different cores * Cache access rate alone and shared as the metric to estimate slowdown * Cache paritiioning * How to determine partitioning * Utility based cache partitioning * Others * Maximum slowdown and fairness metric ===== Lecture 32 (4/20 Mon.) ===== * Heterogeneous systems * Asymmetric cores: different types of cores on the chip * Each of these cores are optimized for different workloads/requirements/goals * Multiple special purpose processors * Flexible and can adapt to workload behavior * Disadvantages: complex and high overhead * Examples: CPU-GPU systems, heterogeneity in execution models * Heterogeneous resources * Example: reliable and non-reliable DRAM in the same system * Key problems in modern systems * Memory system * Efficiency * Predictability * Assymmetric design can help solving these problems * Serialized code sections * Bottleneck in multicore execution * Parallelizable vs. serial portion * Accelerate critical section * Cache ping-ponging * Synchronization latency * Symmetric vs. assymmetric design * Large cores + small cores * Core assymmetry * Amdahl's law with heterogeneous cores * Parallel bottlenecks * Resource contention * Depends on what are running * Accelerated critical section * Ship critical sections to large cores * Small modifications and low overhead * False serialization might become the bottleneck * Can reduce parallel throughput * Effect on private cache misses and shared cache misses ===== Lecture 33 (4/27 Mon.) ===== * Interconnects * Connecting multiple components together * Goal: Scalability, flexibility, performance and energy efficiency * Metric: Performance, bandwidth, bisection bandwidth, cost, energy efficienct, system performance, contention, latency * Saturation point * Saturation throughput * Topology * How to wire components together, affects routing, throughput, latency * Bus: All nodes connected to a single ring * Hard to increase frequency, bandwidth, poor scalability but simple * Point-to-point * Low contention and potentially low latency. Costly, not scalable and hard to wire. * Crossbar * No contention. Concurrent request from different src/dest can be sent concurrently. Costly. * Multistage logarithmic network * Indirect network, low contention, multiple request can be sent concurrently. More scalable compared to crossbar. * Circuit switch * Omega network, delta network. * Butterfly network * Intermediate switch between sources and destinations * Switching vs. topology * Ring * Each node connected to two other nodes, forming a ring * Low overhead, high latency, not as scalable. * Unidirectional ring and bi-directional ring * Hierarchical Rings * Layers of rings. More scalable, lower latency. * Bridge router connect multiple rings together * Mesh * 4 input and output ports * More bisection bandwidth and more scalable * Easy to layout * Path diversity * Routers are more complex * Tree * Another hierarchical topology * Specialized topology * Good for local traffic * Fat tree: higher level have more bandwidth * CM-5 Fat tree * Fat tree with 4x2 switches * Hypercube * N-Dimensional cubes * Caltech cosmic cube * Very complex * Routing algorithm * How does message get sent from source to destination * Static or adaptive * Handling contention * Buffering helps handling contention, but adds complexity * Three types of routing algorithms * Deterministic: always takes the same path * Oblivious: takes different paths without taking into account of the state of the network * For example, Valiant algorithm * Adaptive: takes different paths taking into account of the state of the network * Non-minimal adaptive routing vs. minimal adaptive routing * Minimal path: path that has minimum number of hops * Buffering and flow control * How to store within the network * Handling oversubscription * Source throttling * Bufferless vs. buffered crossbars * Buffer overflow * Bufferless deflection routing * Deflect packets when there is contention * Hot-potato routing