====== Buzzwords ======

Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.

===== Lecture 1 (1/12 Mon.) =====
  * Level of transformation
    * Algorithm
    * System software
    * Compiler
  * Cross abstraction layers
  * Tradeoffs
  * Caches
  * DRAM/memory controller
  * DRAM banks
  * Row buffer hit/miss
  * Row buffer locality
  * Unfairness
  * Memory performance hog
  * Shared DRAM memory system
  * Streaming access vs. random access
  * Memory scheduling policies
  * Scheduling priority
  * Retention time of DRAM
  * Process variation
  * Retention time profile
  * Power consumption
  * Bloom filter
  * Hamming code
  * Hamming distance
  * DRAM row hammer

===== Lecture 2 (1/14 Wed.) =====
  * Moore's Law
  * Algorithm --> step-by-step procedure to solve a problem
  * in-order execution
  * out-of-order execution
  * technologies that are available on cellphones
  * new applications that are made available through new computer architecture techniques
    * more data mining (genomics/medical areas)
	* lower power (cellphones)
	* smaller cores (cellphones/computers)
	* etc.
  * Performance bottlenecks in a single thread/core processors
    * multi-core as an alternative
  * Memory wall (a part of scaling issue)
  * Scaling issue
    * Transistor are getting smaller
  * Key components of a computer
  * Design points
    * Design processors to meet the design points
  * Software stack
  * Design decisions
  * Datacenters
  * Reliability problems that cause errors
  * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book)
    * Pre-paradigm science
    * Normal science
    * Revolutionary science
  * Components of a computer
    * Computation
      * Communication
      * Storage
        * DRAM
        * NVRAM (Non-volatile memory): PCM, STT-MRAM
        * Storage (Flash/Harddrive)
  * Von Neumann Model (Control flow model)
    * Stored program computer
      * Properties of Von Neumann Model: Stored program, sequential instruction processing
      * Unified memory
        * When does an instruction is being interpreted as an instruction (as oppose to a datum)?
      * Program counter
      * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
  * Data flow model
    * Data flow machine
      * Data flow graph
    * Operands
    * Live-outs/Live-ins
      * Different types of data flow nodes (conditional/relational/barrier)
    * How to do transactional transaction in dataflow?
      * Example: bank transactions	
  * Tradeoffs between control-driven and data-driven
    * What are easier to program?
      * Which are easy to compile?
      * What are more parallel (does that mean it is faster?)
      * Which machines are more complex to design?
    * In control flow, when a program is stop, there is a pointer to the current state (precise state).
  * ISA vs. Microarchitecture
    * Semantics in the ISA
      * uArch should obey the ISA
      * Changing ISA is costly, can affect compatibility.
  * Instruction pointers
  * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
    * Conceptual techniques
      * Pipelining 
      * Multiple instructions at a time
      * Out-of-order executions
      * etc.
        * Design techniques
          * Adder implementation (Bit serial, ripple carry, carry lookahead)
          * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
  * Microprocessor: ISA + uArch + circuits
  * What are a part of the ISA? Instructions, memory, etc.
    * Things that are visible to the programmer/software
  * What are not a part of the ISA? (what goes inside: uArch techniques)
    * Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex

===== Lecture 3 (1/17 Fri.) =====

   * Microarchitecture
   * Three major tradeoffs of computer architecture
   * Macro-architecture
   * LC-3b ISA
   * Unused instructions
   * Bit steering
   * Instruction processing style
   * 0,1,2,3 address machines
   * Stack machine
   * Accumulator machine
   * 2-operand machine
   * 3-operand machine
   * Tradeoffs between 0,1,2,3 address machines
   * Postfix notation
   * Instructions/Opcode/Operand specifiers (i.e. addressing modes)
   * Simply vs. complex data type (and their tradeoffs)
   * Semantic gap and level
   * Translation layer
   * Addressability
   * Byte/bit addressable machines
   * Virtual memory
   * Big/little endian
   * Benefits of having registers (data locality)
   * Programmer visible (Architectural) state
   * Programmers can access this directly
   * What are the benefits?
   * Microarchitectural state
   * Programmers cannot access this directly
   * Evolution of registers (from accumulators to registers)
   * Different types of instructions
   * Control instructions
   * Data instructions
   * Operation instructions
   * Addressing modes
   * Tradeoffs (complexity, flexibility, etc.)
   * Orthogonal ISA
   * Addressing modes that are orthogonal to instruction types
   * I/O devices
   * Vectored vs. non-vectored interrupts
   * Complex vs. simple instructions
   * Tradeoffs
   * RISC vs. CISC
   * Tradeoff
   * Backward compatibility
   * Performance
   * Optimization opportunity
   * Translation

===== Lecture 4 (1/21 Wed.) =====

  * Fixed vs. variable length instruction
  * Huffman encoding
  * Uniform vs. non-uniform decode
  * Registers
    * Tradeoffs between number of registers
  * Alignments
    * How does MIPS load words across alignment the boundary

===== Lecture 5 (1/26 Mon.) =====

  * Tradeoffs in ISA: Instruction length
    * Uniform vs. non-uniform
  * Design point/Use cases
    * What dictates the design point?
  * Architectural states
  * uArch
    * How to implement the ISA in the uArch
  * Different stages in the uArch
  * Clock cycles
  * Multi-cycle machine
  * Datapath and control logic
    * Control signals
  * Execution time of instructions/program
    * Metrics and what do they means
  * Instruction processing
    * Fetch
    * Decode
    * Execute
    * Memory fetch
    * Writeback
  * Encoding and semantics
  * Different types of instructions (I-type, R-type, etc.)
  * Control flow instructions
  * Non-control flow instructions
  * Delayed slot/Delayed branch
  * Single cycle control logic
  * Lockstep
  * Critical path analysis
    * Critical path of a single cycle processor
  * What is in the control signals?
    * Combinational logic & Sequential logic
  * Control store
  * Tradeoffs of a single cycle uarch
  * Design principles
    * Common case design
    * Critical path design
    * Balanced designs
    * Dynamic power/Static power
      * Increases in power due to frequency

===== Lecture 6 (1/28 Mon.) =====

  * Design principles
    * Common case design
    * Critical path design
    * Balanced designs
  * Multi cycle design
  * Microcoded/Microprogrammed machines
    * States
    * Translation from one state to another
    * Microinstructions
    * Microsequencing
    * Control store - Product control signals
    * Microsequencer   
    * Control signal
      * What do they have to control?    
  * Instruction processing cycle
  * Latch signals
  * State machine
  * State variables
  * Condition code
  * Steering bits
  * Branch enable logic
  * Difference between gating and loading? (write enable vs. driving the bus)
  * Memory mapped I/O
  * Hardwired logic
    * What control signals come from hardwired logic?
  * Variable latency memory
  * Handling interrupts
  * Difference between interrupts and exceptions
  * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
  * Updating machine behavior
  * Horizontal microcode
  * Vertical microcode
  * Primitives
  
===== Lecture 7 (1/30 Fri.) =====

  * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
  * Updating machine behavior
  * Horizontal microcode
  * Vertical microcode
  * Primitives
  * nanocode and millicode
    * what are the differences between nano/milli/microcode
  * microprogrammed vs. hardwire control
  * Pipelining
  * Limitations of the multi-programmed design
    * Idle resources
  * Throughput of a pipelined design
    * What dictates the throughput of a pipelined design?
  * Latency of the pipelined design
  * Dependency
  * Overhead of pipelining
    * Latch cost?
  * Data forwarding/bypassing
  * What are the ideal pipeline?
  * External fragmentation
  * Issues in pipeline designs
    * Stalling
      * Dependency (Hazard)
        * Flow dependence
        * Output dependence
        * Anti dependence
        * How to handle them?
    * Resource contention
    * Keeping the pipeline full
    * Handling exception/interrupts
    * Pipeline flush
    * Speculation
  
===== Lecture 8 (2/2 Mon.) =====  

  * Interlocking
  * Multipath execution
  * Fine grain multithreading
  * No-op (Bubbles in the pipeline)
  * Valid bits in the instructions
  * Branch prediction
  * Different types of data dependence
  * Pipeline stalls
    * bubbles
    * How to handle stalls
    * Stall conditions
    * Stall signals
    * Dependences
      * Distant between dependences
    * Data forwarding/bypassing
    * Maintaining the correct dataflow
  * Different ways to design data forwarding path/logic
  * Different techniques to handle interlockings
    * SW based
    * HW based
  * Profiling
    * Static profiling
    * Helps from the software (compiler)
      * Superblock optimization
      * Analyzing basic blocks
  * How to deal with branches?
    * Branch prediction
    * Delayed branching (branch delay slot)
    * Forward control flow/backward control flow
    * Branch prediction accuracy
  * Profile guided code positioning
    * Based on the profile info. position the code based on it
    * Try to make the next sequential instruction be the next inst. to be executed
  * Predicate combining (combine predicate for a branch instruction)
  * Predicated execution (control dependence becomes data dependence)
  
    
===== Lecture 9 (2/4 Wed.) =====  
  * Definition of basic blocks
  * Control flow graph
  * Delayed branching
    * benefit?
    * What does it eliminates?
    * downside?
    * Delayed branching in SPARC (with squashing)
    * Backward compatibility with the delayed slot
    * What should be filled in the delayed slot
    * How to ensure correctness
  * Fine-grained multithreading
    * fetch from different threads 
    * What are the issues (what if the program doesn't have many threads)
    * CDC 6000
    * Denelcor HEP
    * No dependency checking
    * Inst. from different thread can fill-in the bubbles
    * Cost?
  * Simultaneuos multithreading
  * Branch prediction
    * Guess what to fetch next.
    * Misprediction penalty
    * Need to guess the direction and target
    * How to perform the performance analysis?
      * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction.
      * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
        * How many extra instructions are being fetched?
        * What is the performance degradation?
    * How to reduce the miss penalty?
    * Predicting the next address (non PC+4 address)
    * Branch target buffer (BTB)
      * Predicting the address of the branch
    * Global branch history - for directions
    * Can use compiler to profile and get more info
      * Input set dictates the accuracy
      * Add time to compilation
    * Heuristics that are common and doesn't require profiling.
      * Might be inaccurate
      * Does not require profiling
    * Static branch prediction
      * Programmer provides pragmas, hinting the likelihood of taken/not taken branch
      * For example, x86 has the hint bit
    * Dynamic branch prediction
      * Last time predictor
      * Two bits counter based prediction
        * One more bit for hysteresis

===== Lecture 10 (2/6 Fri.) =====
  * Branch prediction accuracy
    * Why are they very important?
      * Differences between 99% accuracy and 98% accuracy
      * Cost of a misprediction when the pipeline is very deep
  * Global branch correlation
    * Some branches are correlated
  * Local branch correlation
    * Some branches can depend on the result of past branches
  * Pattern history table
    * Record global taken/not taken results. 
    * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?)
  * One-level branch predictor
    * What information are used
  * Two-level branch prediction
    * What entries do you keep in the global history?
    * What entries do you keep in the local history?
    * How many table?
    * Cost when training a table
    * What are the purposes of each table?
    * Potential problems of a two-level history
  * GShare predictor
    * Global history predictor is hashed with the PC
    * Store both GHP and PC in one combined information
    * How do you use the information? Why does the XOR result still usable?
  * Warmup cost of the branch predictor
    * Hybrid solution? Fast warmup is used first, then switch to the slower one.
  * Tournament predictor (Alpha 21264)
  * Predicated execution - eliminate branches
    * What are the tradeoffs
    * What if the block is big (can lead to execution a lot of useless work)
    * Allows easier code optimization
      * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
      * Reduce control dependences
    * Need ISA support
  * Wish branches
    * Compiler generate both predicated and non-predicated codes
    * HW design which one to use
      * Use branch prediction on an easy to predict code
      * Use predicated execution on a hard to predict code
      * Compiler can be more aggressive in optmizing the code
    * What are the tradeoffs (slide# 47)
  * Multi-path execution
    * Execute both paths
    * Can lead to wasted work
    * VLIW
    * Superscalar


===== Lecture 11 (2/11 Wed.) =====

  * Geometric GHR length for branch prediction
  * Perceptron branch predictor
  * Multi-cycle executions (Different functional units take different number of cycles)
    * Instructions can retire out-of-order
      * How to deal with this case? Stall? Throw exceptions if there are problems?
  * Exceptions and Interrupts
    * When they are handled? 
    * Why are some interrupts should be handled right away?
  * Precise exception
    * arch. state should be consistent before handling the exception/interrupts
      * Easier to debug (you see the sequential flow when the interrupt occurs)
        * Deterministic
      * Easier to recover from the exception
      * Easier to restart the processes
    * How to ensure precise exception?
    * Tradeoffs between each method
  * Reorder buffer
    * Reorder results before they are visible to the arch. state
      * Need to preserve the sequential semantic and data
    * What are the information in the ROB entry
    * Where to get the value from (forwarding path? reorder buffer?)
      * Extra logic to check where the youngest instructions/value is
      * Content addressable search (CAM)
        * A lot of comparators
    * Different ways to simplify the reorder buffer
    * Register renaming
      * Same register refers to independent values (lacks of registers)
    * Where does the exception happen (after retire)
  * History buffer
    * Update the register file when the instruction complete. Unroll if there is an exception.
  * Future file (commonly used, along with reorder buffer)
    * Keep two set of register files
      * An updated value (Speculative), called future file
      * A backup value (to restore the state quickly
    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone)
  * Branch misprediction resembles Exception
    * The difference is that branch misprediction is not visible to the software
      * Also much more common (say, divide by zero vs. a mispredicted branch)
    * Recovery is similar to exception handling
  * Latency of the state recovery
  * What to do during the state recovery
  * Checkpointing
    * Advantages?

===== Lecture 12 (2/13 Fri.) =====

  * Renaming
  * Register renaming table
  * Predictor (branch predictor, cache line predictor ...)
  * Power budget (and its importance)
  * Architectural state, precise state
  * Memory dependence is known dynamically
  * Register state is not shared across threads/processors
  * Memory state is shared across threads/processors
  * How to maintain speculative memory states
  * Write buffers (helps simplify the process of checking the reorder buffer)
  * Overall OoO mechanism
    * What are other ways of eliminating dispatch stalls
    * Dispatch when the sources are ready
    * Retired instructions make the source available
    * Register renaming
    * Reservation station
      * What goes into the reservation station
      * Tags required in the reservation station
    * Tomasulo's algorithm
    * Without precise exception, OoO is hard to debug
    * Arch. register ID
    * Examples in the slides
      * Slides 28 --> register renaming
      * Slides 30-35 --> Exercise (also on the board)
        * This will be useful for the midterm
    * Register aliasing table
    * Broadcasting tags
    * Using dataflow


===== Lecture 13 (2/16 Mon.) =====

  * OoO --> Restricted Dataflow
    * Extracting parallelism
    * What are the bottlenecks?
      * Issue width
      * Dispatch width
      * Parallelism in the program
    * What does it mean to be restricted data flow
      * Still visible as a Von Neumann model
    * Where does the efficiency come from?
    * Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense?
  * Load/store handling
    * Would like to schedule them out of order, but make them visible in-order
    * When do you schedule the load/store instructions?
    * Can we predict if load/store are dependent?
    * This is one of the most complex structure of the load/store handling
    * What information can be used to predict these load/store optimization?
  * Centralized vs. distributed? What are the tradeoffs?
  * How to handle when there is a misprediction/recovery
    * OoO + branch prediction?
    * Speculatively update the history register
      * When do you update the GHR?
  * Token dataflow arch.
    * What are tokens?
    * How to match tokens
    * Tagged token dataflow arch.
    * What are the tradeoffs?
    * Difficulties?

===== Lecture 14 (2/18 Wed.) =====

  * SISD/SIMD/MISD/MIMD
  * Array processor
  * Vector processor
  * Data parallelism
    * Where does the concurrency arise?
  * Differences between array processor vs. vector processor
  * VLIW
  * Compactness of an array processor 
  * Vector operates on a vector of data (rather than a single datum (scalar))
    * Vector length (also applies to array processor)
    * No dependency within a vector --> can have a deep pipeline
    * Highly parallel (both instruction level (ILP) and memory level (MLP))
    * But the program needs to be very parallel
    * Memory can be the bottleneck (due to very high MLP)
    * What does the functional units look like? Deep pipeline and simpler control.
    * CRAY-I is one of the examples of vector processor
    * Memory access pattern in a vector processor
      * How do the memory accesses benefit the memory bandwidth?
      * Memory level parallelism
      * Stride length vs. the number of banks
        * stride length should be relatively prime to the number of banks
      * Tradeoffs between row major and column major --> How can the vector processor deals with the two
    * How to calculate the efficiency and performance of vector processors
    * What if there are multiple memory ports?
    * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
      * Helps handling sparse matrices
    * Conditional operation
    * Structure of vector units
    * How to automatically parallelize code through the compiler?
      * This is a hard problem. Compiler does not know the memory address.
  * What do we need to ensure for both vector and array processor?
  * Sequential bottleneck 
    * Amdahl's law  
  * Intel MMX --> An example of Intel's approach to SIMD
    * No VLEN, use OpCode to define the length
    * Stride is one in MMX
    * Intel SSE --> Modern version of MMX

===== Lecture 15 (2/20 Fri.) =====
  * GPU
    * Warp/Wavefront
      * A bunch of threads sharing the same PC
    * SIMT
    * Lanes
    * FGMT + massively parallel
      * Tolerate long latency
    * Warp based SIMD vs. traditional SIMD
  * SPMD (Programming model)
    * Single program operates on multiple data
      * can have synchronization point
    * Many scientific applications are programmed in this manner
  * Control flow problem (branch divergence)
    * Masking (in a branch, mask threads that should not execute that path)
    * Lower SIMD efficiency
    * What if you have layers of branches?
  * Dynamic warp formation
    * Combining threads from different warps to increase SIMD utilization
    * This can cause memory divergence
  * VLIW
    * Wide fetch
    * IA-64
    * Tradeoffs
      * Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
      * A lot of loads at the compiler level
  * Decoupled access/execute
    * Limited form of OoO
    * Tradeoffs
    * How to street the instruction (determine dependency/stalling)?
    * Instruction scheduling techniques (static vs. dynamic)
  * Systolic arrays
    * Processing elements transform data in chains
    * Develop for image processing (for example, convolution)
  * Stage processing
    

===== Lecture 16 (2/23 Mon.) =====
  * Systolic arrays
    * Processing elements transform data in chains
    * Can be arrays of multi-dimensional processing elements
    * Develop for image processing (for example, convolution)
    * Can be use to break stages in pipeline programs, using a set of queues and processing elements
    * Can enable high concurrency and good for regular programs
    * Very special purpose
    * The warp computer
  * Static instruction scheduling
    * How do we find the next instruction to execute?
  * Live-in and live-out
  * Basic blocks
    * Rearranging instructions in the basic block
    * Code movement from one basic block to another
  * Straight line code
  * Independent instructions
    * How to identify independent instructions
  * Atomicity
  * Trace scheduling
    * Side entrance
    * Fixed up code
    * How scheduling is done
  * Instruction scheduling
    * Prioritization heuristics
  * Superblock
    * Traces with no side-entrance
  * Hyperblock
  * BS-ISA
  * Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA
    
===== Lecture 17 (2/25 Wed.) =====
  * IA-64
    * EPIC
  * IA-64 instruction bundle
    * Multiple instructions in the bundle along with the template bit
    * Template bits
    * Stop bits
    * Non-faulting loads and exception propagation
  * Aggressive ST-LD reordering
  * Physical memory system
  * Ideal pipelines
  * Ideal cache
    * More capacity
    * Fast
    * Cheap
    * High bandwidth
  * DRAM cell
    * Cheap
    * Sense the perturbation through sense amplifier
    * Slow and leaky
  * SRAM cell (Cross coupled inverter)
    * Expensice
    * Fast (easier to sense the value in the cell)
  * Memory bank
    * Read access sequence
    * DRAM: Activate -> Read -> Precharge (if needed)
    * What dominate the access latency for DRAM and SRAM
  * Scaling issue
    * Hard to scale the scale to be small
  * Memory hierarchy
    * Prefetching
    * Caching
  * Spatial and temporal locality
    * Cache can exploit these
    * Recently used data is likely to be accessed
    * Nearby data is likely to be accessed
  * Caching in a pipeline design
  * Cache management
    * Manual
      * Data movement is managed manually
        * Embedded processor
        * GPU scratchpad
    * Automatic
      * HW manage data movements
  * Latency analysis
    * Based on the hit and miss status, next level access time (if miss), and the current level access time
  * Cache basics
    * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc.
  * Cache access
    * How to access tag and data (in parallel vs serially)
    * How do tag and index get used?
    * Modern processors perform serial access for higher level cache (L3 for example) to save power
  * Cost and benefit of having more associativity
    * Given the associativity, which block should be replace if it is full
    * Replacement policy
      * Random
      * Least recently used (LRU)
      * Least frequently used
      * Least costly to refetch
      * etc.
  * How to implement LRU
    * How to keep track of access ordering
      * Complexity increases rapidly
    * Approximate LRU
      * Victim and next Victim policy
 
===== Lecture 18 (2/27 Fri.) =====
  * Tag store and data store
  * Cache hit rate
  * Average memory access time (AMAT)
  * AMAT vs. Stall time
  * Cache basics
    * Direct mapped vs. associative cache
    * Set/block (line)/Placement/replacement
    * How do tag and index get used?
  * Full associativity
  * Set associative cache
    * insertion, promotion, eviction (replacement)
  * Various replacement policies
  * How to implement LRU
    * How to keep track of access ordering
      * Complexity increases rapidly
    * Approximate LRU
      * Victim and next Victim policy
  * Set thrashing
    * Working set is bigger than the associativity
  * Belady's OPT
    * Is this optimal?
    * Complexity?
  * DRAM as a cache for disk
  * Handling writes
    * Write through
      * Need a modified bit to make sure accesses to data got the updated data
    * Write back
      * Simpler, no consistency issues
  * Sectored cache
    * Use subblock
      * lower bandwidth
      * more complex
  * Instruction vs data cache
    * Where to place instructions
      * Unified vs. separated
    * In the first level cache
  * Cache access
    * First level access
    * Second level access
      * When to start the second level access
  * Cache performance
    * capacity
    * block size
    * associativity
  * Classification of cache misses
===== Lecture 19 (03/02 Mon.) =====
  * Subblocks
  * Victim cache
    * Small, but fully assoc. cache behind the actual cache
    * Cached misses cache block
    * Prevent ping-ponging
  * Pseudo associativity
    * Simpler way to implement associative cache
  * Skewed assoc. cache
    * Different hashing functions for each way
  * Restructure data access pattern
    * Order of loop traversal
    * Blocking
  * Memory level parallelism
    * Cost per miss of a parallel cache miss is less costly compared to serial misses
  * MSHR
    * Keep track of pending cache
      * Think of this as the load/store buffer-ish for cache
    * What information goes into the MSHR?
    * When do you access the MSHR?
  * Memory banks
  * Shared caches in multi-core processors
===== Lecture 20 (03/04 Wed.) =====
  * Virtual vs. physical memory
  * System's management on memory
    * Benefits
  * Problem: physical memory has limited size
  * Mechanisms: indirection, virtual addresses, and translation
  * Demand paging
  * Physical memory as a cache
  * Tasks of system SW for VM
  * Serving a page fault
  * Address translation
  * Page table
    * PTE (page table entry)
  * Page replacement algorithm
    * CLOCK algo.
    * Inverted page table
  * Page size trade-offs
  * Protection
  * Multi-level page tables
  * x86 implementation of page table
  * TLB
    * Handling misses
  * When to do address translation?
  * Homonym and Synonyms
    * Homonym: Same VA but maps to different PA with multiple processes
    * Synonyms: Multiple VAs map to the same PA
      * Shared libraries, shared data, copy-on-write
  * Virtually indexed vs. physically indexed
  * Virtually tagged vs. physically tagged
  * Virtually indexed physically tagged
  * Can these create problems when we have the cache
  * How to eliminate these problems?
  * Page coloring
  * Interaction between cache and TLB


===== Lecture 21 (03/23 Mon.) =====

  * DRAM scaling problem
  * Demands/trends affecting the main memory
    * More capacity
    * Low energy
    * More bandwidth
    * QoS
  * ECC in DRAM
  * Multi-porting
    * Virtual multi-porting
      * Time-share the port, not too scalable but cheap
    * True multiporting
  * Multiple cache copies
  * Alignment
  * Banking
    * Can have bank conflict
    * Extra interconnects across banks
    * Address mapping can mitigate bank conflict
    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
  * Bank mapping
    * How to avoid bank conflicts?
  * Channel mapping
    * Address mapping to minimize bank conflict
    * Page coloring
      * Virtual to physical mapping that can help reducing conflicts
  * Accessing DRAM
    * Row bits
    * Column bits
    * Addressibility
    * DRAM has its own clock
    * Sense amplifier
    * Bit lines
    * Word lines
  * DRAM (2T) vs. SRAM (6T)
    * Cost
    * Latency
  * Interleaving in DRAM
    * Effects from address mapping on memory interleaving
    * Effects from memory access patterns from the program on interleaving
  * DRAM Bank
    * To minimize the cost of interleaving (Shared the data bus and the command bus)
  * DRAM Rank
    * Minimize the cost of the chip (a bundle of chips operated together)
  * DRAM Channel
    * An interface to DRAM, each with its own ranks/banks
  * DRAM Chip
  * DIMM
    * More DIMM adds the interconnect complexity
  * List of commands to read/write data into DRAM
    * Activate -> read/write -> precharge
    * Activate moves data into the row buffer
    * Precharge prepare the bank for the next access
  * Row buffer hit
  * Row buffer conflict
  * Scheduling memory requests to lower row conflicts
  * Burst mode of DRAM
    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
  * Address mapping
    * Row interleaved
    * Cache block interleaved
  * Memory controller
    * Sending DRAM commands
    * Periodically send commands to refresh DRAM cells
    * Ensure correctness and data integrity
    * Where to place the memory controller
      * On CPU chip vs. at the main memory
        * Higher BW on-chip
    * Determine the order of requests that will be serviced in DRAM
      * Request queues that hold requests
      * Send requests whenever the request can be sent to the bank
      * Determine which command (across banks) should be sent to DRAM

===== Lecture 22 (03/25 Wed.) =====


  * Flash controller
  * Flash memory
  * Garbage collection in flash
  * Overhead in flash memory
    * Erase (off the critical path, but takes a long time)
  * Different types of DRAM
  * DRAM design choices
    * Cost/density/latency/BW/Yield
  * Sense Amplifier
    * How do they work
  * Dual data rate
  * Subarray
  * Rowclone
    * Moving bulk of data from one row to others
    * Lower latency and BW when performing copies/zeroes out the data
  * TL-DRAM
    * Far segment
    * Near segment
    * What causes the long latency
    * Benefit of TL-DRAM
      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
  * List of commands to read/write data into DRAM
    * Activate -> read/write -> precharge
    * Activate moves data into the row buffer
    * Precharge prepare the bank for the next access
  * Row buffer hit
  * Row buffer conflict
  * Scheduling memory requests to lower row conflicts
  * Burst mode of DRAM
    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
  * Address mapping
    * Row interleaved
    * Cache block interleaved
  * Memory controller
    * Sending DRAM commands
    * Periodically send commands to refresh DRAM cells
    * Ensure correctness and data integrity
    * Where to place the memory controller
      * On CPU chip vs. at the main memory
        * Higher BW on-chip
    * Determine the order of requests that will be serviced in DRAM
      * Request queues that hold requests
      * Send requests whenever the request can be sent to the bank
      * Determine which command (across banks) should be sent to DRAM
  * Priority of demand vs. prefetch requests
  * Memory scheduling policies
    * FCFS
    * FR-FCFS
      * Try to maximize row buffer hit rate
      * Capped FR-FCFS: FR-FCFS with a timeout
      * Usually this is done in a command level (read/write commands and precharge/activate commands)
    * PAR-BS
      * Key benefits
      * stall time
      * shortest job first
    * STFM
    * ATLAS
    * TCM
      * Key benefits
      * Configurability
      * Fairness + performance at the same time
      * Robuestness isuees
  * Open row policy
  * Closed row policy
  * QoS
    * QoS issues in memory scheduling
    * Fairness
    * Performance guarantee


===== Lecture 23 (03/27 Fri.) =====

  * Different ways to control interference in DRAM
    * Partitioning of resource
      * Channel partitioning: map applications that interfere with each other in a different channel
        * Keep track of application's characteristics
        * Dedicate a channel might waste the bandwidth
        * Need OS support to determine the channel bits
    * Source throttling
      * A controller throttle the core depends on the performance target
      * Example: Fairness via source throttling
        * Detect unfairness and throttle application that is interfering
        * How do you estimate slowdown?
        * Threshold based solution: hard to configure
    * App/thread scheduling
      * Critical threads usually stall the progress
    * Designing DRAM controller
      * Has to handle the normal DRAM operations
        * Read/write/refresh/all the timing constraints
      * Keep track of resources
      * Assign priorities to different requests
      * Manage requests to banks
    * Self-optimizing controller
      * Use machine learning to improve DRAM controller
    * A-DRM
      * Architecture aware DRAM
  * Multithread
    * synchronization
    * Pipeline programs
      * Producer consumer model
    * Critical path
    * Limiter threads
    * Prioritization between threads
  * Different power mode in DRAM
  * DRAM Refresh
    * Why does DRAM has to refresh every 64ms
    * Banks are unavailable during refresh
      * LPDDR mitigate this by using a per-bank refresh
    * Has to spend longer time with bigger DRAM
    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
      * As oppose to burst refresh (long pause time)
  * RAIDR: Reduce DRAM refresh by profiling and binning
    * Some row do not have to be refresh very frequently
      * Profile the row
        * High temperature changes the retention time: need online profiling
  * Bloom filter
    * Represent set membership
    * Approximated
    * Can contain false positive
      * Better/more hash function helps eliminate this
  
===== Lecture 24 (03/30 Mon.) =====

  * Simulation
    * Drawbacks of RTL simulations
      * Time consuming
      * Complex to develop
      * Hard to perform design explorations
    * Explore the design space quickly
    * Match the behavior of existing systems
    * Tradeoffs: speed, accuracy, flexibility
    * High-level simulation vs. detailed simulation
      * High-level simulation is faster, but lower accuracy
  * Controllers that works on multiple types of cores
    * Design problems: how to find a good scheduling policy on its own?
    * Self-optimizing memory controller: using machine learning
      * Can adapt to the applications
      * The complexity is very high      
  * Tolerate latency can be costly
    * Instruction window is complex
      * Benefit also diminishes
    * Designing the buffers can be complex
    * A simpler way to tolerate out of order is desirable
  * Different sources that cause the core to stall in OoO
    * Cache miss
    * Note that stall happens if the inst. window is full
  * Scaling instruction window size is hard
    * It is better (less complex) to make the windows more efficient
  * Runahead execution
    * Try to optain MLP w/o increasing instruction windows
    * Runahead (i.e. execute ahead) when there is a long memory instruction
      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
      * Execute future instruction to generate accurate prefetches
      * Allow future data to be in the cache
    * How to support runahead execution?
      * Need a way to checkpoing the state when entering runahead mode
      * How to make executing in the wrong path useful?
      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)


===== Lecture 25 (4/1 Wed.) =====

  * More Runahead executions
    * How to support runahead execution?
      * Need a way to checkpoing the state when entering runahead mode
      * How to make executing in the wrong path useful?
      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
    * Cost and benefit of runahead execution (slide number 27)
    * Runahead can have inefficiency
      * Runahead period that are useless
        * Get rid of useless inefficient period
    * What if there is a dependent cache miss
      * Cannot be paralellized in a vanilla runahead
      * Can predict the value of the dependent load
        * How to predict the address of the load
          * Delta value information
          * Stride predictor
          * AVD prediction
  * Questions regarding prefetching
    * What to prefetch
    * When to prefetch
    * how do we prefetch
    * where to prefetch from
  * Prefetching can cause thrasing (evict a useful block)
  * Prefetching can also be useless (not being used)
    * Need to be efficient
  * Can cause memory bandwidth problem in GPU
  * Prefetch the whole block, more than one block, or subblock?
    * Each one of them has pros and cons
    * Big prefetch is more likely to waste bandwidth
    * Commonly done in a cache block granularity
  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches
  * Prefetcher usually predict based on
    * Past knowledge
    * Compiler hints
  * Prefetcher has to prefetch at the right time
    * Prefetch that is too early might get evicted
      * It might also evict other useful data
    * Prefetch too late does not hide the whole memory latency
  * Previous prefetches at the same PC can be used as the history
  * Previous demand requests also is a good information to use for prefetches
  * Prefetch buffer
    * Place the prefetch data to avoid thrashing
      * Can treat demand/prefetch requests separately
      * More complex
  * Generally, demand block is more important
    * This means eviction should prefer prefetch block as oppose to demand block
  * Tradeoffs between where do we place the prefetcher
    * Look at L1 hits and misses
    * Look at L1 misses only
    * Look at L2 misses
    * Different access pattern affect accuracy
      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
  * Software vs. hardware vs. execution based prefetching
    * Software: ISA previde prefetch instructions, software utilize it
      * What information are useful
      * How to make sure the prefetch is timely
      * What if you have a pointer based structure
        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
          * Can be solved by hinting the nextnext and/or nextnextnext address
    * Hardware: Identify the pattern and prefetch
    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
  * Stride prefetcher
    * Predict strides, which is common in many programs
    * Cache block based or instruction based
  * Stream buffer design
    * Buffer the stream of accesses (next address) 
    * Use the information to prefetch
  * What affect prefetcher performance
    * Prefetch distance
      * How far ahead should we prefetch
    * Prefetch degree
      * How many prefetches do we prefetch
  * Prefetcher performance
    * Coverage
      * Out of the demand requests, how many are actually from the prefetch request
    * Accuracy
      * Out of all the prefetch requests, how many are actually getting used
    * Timeliness
      * How much memory latency can we hide from the prefetch requests
    * Cache pullition
      * How much did the prefetcher cause misses in the demand misses?
        * Hard to quantify


===== Lecture 26 (4/3 Fri.) =====

  * Feedback directed prefetcher
    * Use the result of the prefetcher as a feedback to the prefetcher
      * with accuracy, timeliness, polluting information
  * Markov prefetcher
    * Prefetch based on the previous history
    * Use markov model to predict
    * Pros: Can cover arbitary pattern (easy for link list traversal or trees)
    * Downside: High cost, cannot help with compulsory misses (no history)
  * Content directed prefetching
    * Indentify the content in memory for pointers (which is used as the address to prefetch
    * Not very efficient (hard to figure out which block is the pointer)
      * Software can give hints
  * Correlation table
    * Address correlation
  * Execution based prefetcher
    * Helper thread/speculative thread
      * Use another thread to pre-execute a program
    * Can be a software based or hardware based
    * Discover misses before the main program (to prefetch data in a timely manner)
    * How do you construct the helper thread
    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
    * Thread-based pre-execution
  * Error tolerance
  * Solution to errors
    * Tolerate errors
      * New interface, new design
    * Eliminate or minimize errors
      * New technology, system-wide rethinking
    * Embrace errors
      * Map data that can tolerate errors to error-prone area
  * Hybrid memory systesm
    * Combining multiple memory technology together  
  * What can emerging technology help?
    * Scalability
    * Lower the cost
    * Energy efficiency
  * Possible solutions to the scaling problem
    * Less leakage DRAM
    * Heterogeneous DRAM (TL-DRAM, etc.)
    * Add more functionality to DRAM
    * Denser design (3D stack)
    * Different technology
      * NVM
  * Charge vs. resistice memory
    * How data is written?
    * How to read the data?
  * Non volatile memory
    * Resistive memory
      * PCM
        * Inject current to change the phase
        * Scales better than DRAM
          * Multiple bits per cell
            * Wider resistence range
        * No refresh is needed
        * Downside: Latency and write endurance
      * STT-MRAM
        * Inject current to change the polarity
      * Memristor
        * Inject current to change the structure
    * Pros and cons between different technologies
    * Persistency - data stay there even without power
      * Unified memory and storage management (persistent data structure) - Single level store
        * Improve energy and performance
        * Simplify programming model
  * Different design options for DRAM + NVM
    * DRAM as a cache
    * Place some data in DRAM and other in PCM
      * Based on the characteristics
        * Frequently accessed data that need lower write latency in DRAM
  

===== Lecture 27 (4/6 Mon.) =====
  * Flynn's taxonomy
  * Parallelism
    * Reduces power consumption (P ~ CV^2F)
    * Better cost efficiency and easier to scale
    * Improves dependability (in case the other core is faulty
  * Different types of parallelism
    * Instruction level parallelism
    * Data level parallelism
    * Task level parallelism
  * Task level parallelism
    * Partition a single, potentially big, task into multiple parallel sub-task
      * Can be done explicitly (parallel programming by the programmer)
      * Or implicitly (hardware partitions a single thread speculatively)
    * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement)
  * Loosely coupled multiprocessor
    * No shared global address space
      * Message passing to communicate between different sources
    * Simple to manage memory
  * Tightly coupled multiprocessor
    * Shared global address space
    * Need to ensure consistency of data
    * Programming issues
  * Hardware-based multithreading
    * Coarse grained
    * Find grained
    * Simultaneous: Dispatch instruction from multiple threads at the same time
  * Parallel speedup
    * Superlinear speedup
  * Utilization, Redundancy, Efficiency
  * Amdahl's law
    * Maximum speedup
    * Parallel portion is not perfect
      * Serial bottleneck
      * Synchronization cost
      * Load balance
        * Some threads has more work, requires more time to hit the sync. point
  * Critical sections
    * Enforce mutually exclusive access to shared data
  * Issues in parallel programming
    * Correctness
    * Synchronization
    * Consistency


===== Lecture 28 (4/8 Wed.) =====
  * Ordering of instructions
    * Maintaining memory consistency when there are multiple threads and shared memory
    * Need to ensure the semantic is not changed
    * Making sure the shared data is properly locked when used
      * Support mutual exclusion
    * Ordering depends on when each processor is executed
    * Debugging is also difficult (non-deterministic behavior)
  * Dekker's algorithm
    * Inconsistency -- the two processors did NOT see the same order of operations to memory
  * Sequential consistency
    * Multiple correct global orders
    * Two issues:
        * Too conservative/strict
        * Performance limiting
  * Weak consistency: global ordering when sync
    * programmer hints where the synchronizations are
    * Memory fence
    * More burden on the programmers
  * Cache coherence
    * Can be done in the software level or hardware level
  * Snoop-based coherence
    * A simple protocol with two states by broadcasting reads/writes on a bus
  * Maintaining coherence
    * Needs to provide 1) write propagation and 2) write serialization
    * Update vs. Invalidate
  * Two cache coherence methods
    * Snoopy bus
      * Bus based, single point of serialization
      * More efficient with small number of processors
      * Processors snoop other caches read/write requests to keep the cache block coherent
    * Directory
      * Single point of serialization per block
      * Directory coordinates the coherency
      * More scalable
      * The directory keeps track of where the copies of each block resides
        * Supplies data on a read
        * Invalidates the block on a write
        * Has an exclusive state

===== Lecture 29 (4/10 Fri.) =====
  * MSI coherent protocol
    * The problem: unnecessary broadcasts of invalidations
  * MESI coherent protocol
    * Add the exclusive state: this is the only cache copy and it is a clean state to MSI
    * Multiple invalidation tradeoffs
    * Problem: memory can be unnecessarily updated
    * A possible owner state (MOESI)
  * Tradeoffs between snooping and directory based coherence protocols
    * Slide 31 has a good summary
  * Directory: data structures
    * Bit vectors vs. linked lists
  * Scalability of directories
    * Size? Latency? Thousand of nodes? Best of both snooping and directory?

    
===== Lecture 30 (4/13 Mon.) =====
  * In-memory computing
  * Design goals of DRAM
  * DRAM structures
    * Banks
    * Capacitors and sense amplifiers
    * Trade-offs b/w number of sense amps and cells
    * Width of bank I/O vs. row size
  * DRAM operations
    * ACTIVATE, READ/WRITE, and PRECHARGE
  * Trade-offs
    * Latency
    * Bandwidth: Chip vs. rank vs. bank
      * What's the benefit of having 8 chips?
    * Parallelism
  * RowClone
    * What are the problems?
    * Copying b/w two rows that share the same sense amplifier
    * System software support
  * Bitwise AND/OR

===== Lecture 31 (4/15 Wed.) =====

  * Application slowdown
  * Interference between different applications
    * Applications' performance depends on other applications that they are running with
  * Predictable performance
    * Why are they important?
    * Applications that need predictibility
    * How to predict the performance?
      * What information are useful?
      * What need to be guarantee?
      * How to estimate the performance when running with others?
        * Easy, just measure the performance while it is running.
      * How to estimate the performance when the application is running by itself.
        * Hard if there is no profiling.
      * The relationship between memory service rate and the performance.
        * Key assumption: applications are memory bound
    * Behavior of memory-bound applications
      * With and without interference
  * Memory phase vs. compute phase
  * MISE
    * Estimating slowdown using request service rate
    * Inaccuracy when measuring request service rate alone
    * Non-memory-bound applications
    * Control slowdown and provide soft guarantee
  * Taking into account of the shared cache
    * MISE model + cache resource management
    * Aug tag store
      * Separate tag store for different cores
    * Cache access rate alone and shared as the metric to estimate slowdown
  * Cache paritiioning
    * How to determine partitioning
      * Utility based cache partitioning
      * Others
  * Maximum slowdown and fairness metric
    

===== Lecture 32 (4/20 Mon.) =====

  * Heterogeneous systems
    * Asymmetric cores: different types of cores on the chip
      * Each of these cores are optimized for different workloads/requirements/goals
      * Multiple special purpose processors
      * Flexible and can adapt to workload behavior
      * Disadvantages: complex and high overhead
    * Examples: CPU-GPU systems, heterogeneity in execution models
    * Heterogeneous resources
      * Example: reliable and non-reliable DRAM in the same system
  * Key problems in modern systems
    * Memory system
    * Efficiency
    * Predictability
    * Assymmetric design can help solving these problems
  * Serialized code sections
    * Bottleneck in multicore execution
    * Parallelizable vs. serial portion
    * Accelerate critical section
    * Cache ping-ponging
      * Synchronization latency
    * Symmetric vs. assymmetric design
  * Large cores + small cores
    * Core assymmetry
  * Amdahl's law with heterogeneous cores
  * Parallel bottlenecks
    * Resource contention
      * Depends on what are running
  * Accelerated critical section
    * Ship critical sections to large cores
    * Small modifications and low overhead
    * False serialization might become the bottleneck
    * Can reduce parallel throughput
    * Effect on private cache misses and shared cache misses
    
  
===== Lecture 33 (4/27 Mon.) =====

  * Interconnects
    * Connecting multiple components together
    * Goal: Scalability, flexibility, performance and energy efficiency
  * Metric: Performance, bandwidth, bisection bandwidth, cost, energy efficienct, system performance, contention, latency
    * Saturation point
      * Saturation throughput
  * Topology
    * How to wire components together, affects routing, throughput, latency
    * Bus: All nodes connected to a single ring
      * Hard to increase frequency, bandwidth, poor scalability but simple
    * Point-to-point
      * Low contention and potentially low latency. Costly, not scalable and hard to wire.
    * Crossbar
      * No contention. Concurrent request from different src/dest can be sent concurrently. Costly.
    * Multistage logarithmic network
      * Indirect network, low contention, multiple request can be sent concurrently. More scalable compared to crossbar.
      * Circuit switch
      * Omega network, delta network.
      * Butterfly network
      * Intermediate switch between sources and destinations
    * Switching vs. topology
    * Ring
      * Each node connected to two other nodes, forming a ring
      * Low overhead, high latency, not as scalable.
      * Unidirectional ring and bi-directional ring
    * Hierarchical Rings
      * Layers of rings. More scalable, lower latency.
      * Bridge router connect multiple rings together
    * Mesh
      * 4 input and output ports
      * More bisection bandwidth and more scalable
      * Easy to layout
      * Path diversity
      * Routers are more complex
    * Tree
      * Another hierarchical topology
      * Specialized topology
      * Good for local traffic
      * Fat tree: higher level have more bandwidth
      * CM-5 Fat tree
        * Fat tree with 4x2 switches
    * Hypercube
      * N-Dimensional cubes
      * Caltech cosmic cube
      * Very complex
  * Routing algorithm
    * How does message get sent from source to destination
    * Static or adaptive
    * Handling contention
      * Buffering helps handling contention, but adds complexity
    * Three types of routing algorithms
      * Deterministic: always takes the same path
      * Oblivious: takes different paths without taking into account of the state of the network
        * For example, Valiant algorithm
      * Adaptive: takes different paths taking into account of the state of the network
        * Non-minimal adaptive routing vs. minimal adaptive routing
      * Minimal path: path that has minimum number of hops
  * Buffering and flow control
    * How to store within the network
    * Handling oversubscription
    * Source throttling
    * Bufferless vs. buffered crossbars
    * Buffer overflow
    * Bufferless deflection routing
      * Deflect packets when there is contention
      * Hot-potato routing