User Tools

Site Tools


buzzword

Buzzwords

Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.

Lecture 1 (1/13 Mon.)

  • Level of transformation
    • Algorithm
    • System software
    • Compiler
  • Cross abstraction layers
    • Expose an interface
  • Tradeoffs
  • Caches
  • Multi-thread
  • Multi-core
  • Unfairness
  • DRAM controller/Memory controller
  • Memory hog
  • Row buffer hit/miss
  • Row buffer locality
  • Streaming access/ Random access
  • DRAM refresh
  • Retention time
  • Profiling DRAM retention time
  • Power consumption
  • Wimpy cores
  • Bloom filter
    • Pros/Cons
    • False Positive
  • Simulation
  • Memory performance attacks
  • RTL design

Lecture 2 (1/15 Wed.)

  • Optimizing for energy/ Optimizing for the performance
    • Generally you should optimize for the users
  • state-of-the-art
  • RTL Simulation
    • Long, slow and can be costly
  • High level simulation
    • What should be employed?
  • Important to get the idea of how they are implemented in RTL
  • Allows designer to filter out techniques that do not work well
  • Design points
    • Design processors to meet the design points
  • Software stack
  • Design decisions
  • Datacenters
  • MIPS R2000
    • What are architectural techniques that improve the performance of a processor over MIPS 2000
  • Moore's Law
  • in-order execution
  • out-of-order execution
  • technologies that are available on cellphones
  • new applications that are made available through new computer architecture techniques
    • more data mining (genomics/medical areas)
  • lower power (cellphones)
  • smaller cores (cellphones/computers)
  • etc.
  • Performance bottlenecks in a single thread/core processors
    • multi-core as an alternative
  • Memory wall (a part of scaling issue)
  • Scaling issue
    • Transister are getting smaller
  • Reliability problems that cause errors
  • Analogies from Kuhn's “The Structure of Scientific Revolutions” (Recommended book)
    • Pre paradigm science
    • Normal science
    • Revolutionalry science
  • Components of a computer
    • Computation
      • Communication
        • Storage
          • DRAM
          • NVRAM (Non-volatile memory): PCM, STT-MRAM
          • Storage (Flash/Harddrive)
  • Von Neumann Model (Control flow model)
    • Stored program computer
      • Properties of Von Neumann Model: Stored program, sequential instruction processing
      • Unified memory
        • When does an instruction is being interpreted as an instruction (as oppose to a datum)?
      • Program counter
      • Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
  • Data flow model
    • Data flow machine
      • Data flow graph
    • Operands
    • Live-outs/Live-ins
      • DIfferent types of data flow nodes (conditional/relational/barrier)
    • How to do transactional transaction in dataflow?
      • Example: bank transactions
  • Tradeoffs between control-driven and data-driven
    • What are easier to program?
    • Which are easy to compile?
    • What are more parallel (does that mean it is faster?)
    • Which machines are more complex to design?
    • In control flow, when a program is stop, there is a pointer to the current state (precise state).
  • ISA vs. Microarchitecture
    • Semantics in the ISA
    • uArch should obey the ISA
    • Changing ISA is costly, can affect compatibility.
  • Instruction pointers
  • uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
    • Conceptual techniques
      • Pipelining
      • Multiple instructions at a time
      • Out-of-order executions
      • etc.
    • Design techniques
      • Adder implementation (Bit serial, ripple carry, carry lookahead)
      • Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
  • Microprocessor: ISA + uArch + circuits
  • What are a part of the ISA? Instructions, memory, etc.
    • Things that are visible to the programmer/software
  • What are not a part of the ISA? (what goes inside: uArch techniques)
    • Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex

Lecture 3 (1/17 Fri.)

  • Design tradeoffs
  • Macro Architectures
  • Reconfiguribility vs. specialized designs
  • Parallelism (instructions, data parallel)
  • Uniform decode (Example: Alpha)
  • Steering bits (Sub-opcode)
  • 0,1,2,3 address machines
    • Stack machine
    • Accumulator machine
    • 2-operand machine
    • 3-operand machine
    • Tradeoffs between 0,1,2,3 address machines
  • Instructions/Opcode/Operade specifiers (i.e. addressing modes)
  • Simply vs. complex data type (and their tradeoffs)
  • Semantic gap
  • Translation layer
  • Addressability
  • Byte/bit addressable machines
  • Virtual memory
  • Big/little endian
  • Benefits of having registers (data locality)
  • Programmer visible (Architectural) state
    • Programmers can access this directly
    • What are the benefits?
  • Microarchitectural state
    • Programmers cannot access this directly
  • Evolution of registers (from accumulators to registers)
  • Different types of instructions
    • Control instructions
    • Data instructions
    • Operation instructions
  • Addressing modes
    • Tradeoffs (complexity, flexibility, etc.)
  • Orthogonal ISA
    • Addressing modes that are orthogonal to instructino types
  • Vectors vs. non vectored interrupts
  • Complex vs. simple instructions
    • Tradeoffs
  • RISC vs. CISC
    • Tradeoff
    • Backward compatibility
    • Performance
    • Optimization opportunity

Lecture 4 (1/22 Wed.)

  • Semantic gap
    • Small vs. Large semantic gap (CISC vs. RISC)
    • Benefit of RISC vs. CISC
  • Micro operations/microcode
    • Translate complex instructions into smaller instructions
  • Parallelism (motivation for RISC)
  • Compiler optimization
  • Code optimization through translation
  • VLIW
  • Fixed vs. variable length instructions
    • Tradeoffs
      • Alignment issues? (fetch/decode)
      • Decoding issues?
      • Code size?
      • Adding additional instructions?
      • Memory bandwidth and cache utilization?
      • Energy?
    • Encoding in variable length instructions
  • Structure of Alpha instructions and other uniform decode instructions
    • Different type of instructions
    • Benefit of knowing what type of instructions
      • Speculatively operate future instructions
  • x86 and other non-uniform decode instructions
    • Tradeoff vs. uniform decode
  • Tradeoffs for different number of registers
    • Spilling into memory if the number of registers is small
    • Compiler optimization on how to manage which value to keep/spill
  • Addressing modes
    • Benefits?
    • Types?
    • Different uses of addressing modes?
  • Various ISA-level tradeoffs
  • Virtual memory
  • Unalign memory access/aligned memory access
    • Cost vs. benefit of unaligned access
  • ISA specification
    • Things you have to obey/specifie in the ISA specification
  • Architectural states
  • Microarchitecture implements how arch. state A transformed to the next arch. state A'
  • Single cycle machines
    • Critical path in the single cycle machine
  • Multi cycle machines
  • Functional units
  • Performance metrics
    • CPI/IPC
      • CPI of a single cycle microarchitecture

Lecture 5 (1/24 Fri.)

  • Instruction processing
    • Fetch
    • Decode
    • Execute
    • Memory fetch
    • Writeback
  • Datapath & Control logic in microprocessors
  • Different types of instructions (I-type, R-type, etc.)
  • Control flow instructions
  • Non-control flow instructions
  • Delayed slot/Delayed branch
  • Single cycle control logic
  • Lockstep
  • Critical path analysis
    • Critical path of a single cycle processor
  • Combinational logic & Sequential logic
  • Control store
  • Tradeoffs of a single cycle uarch
  • Dynamic power/Static power
  • Speedup calculation
    • Parallelism
    • Serial bottleneck
    • Amdahl's bottleneck
  • Design principles
    • Common case design
    • Critical path design
    • Balanced designs
  • Multi cycle design

Lecture 6 (1/27 Mon.)

  • Microcoded/Microprogrammed machines
    • States
    • Microinstructions
    • Microsequencing
    • Control store - Product control signals
    • Microsequencer
    • Control signal
      • What do they have to control?
  • Instruction processing cycle
  • Latch signals
  • State machine
  • State variables
  • Condition code
  • Steering bits
  • Branch enable logic
  • Difference between gating and loading? (write enable vs. driving the bus)
  • Memory mapped I/O
  • Hardwired logic
    • What control signals come from hardwired logic?
  • Variable latency memory
  • Handling interrupts
  • Difference betwen interrupts and exceptions
  • Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
  • Updating machine behavior
  • Horizontal microcode
  • Vertical microcode
  • Primitives

Lecture 7 (1/29 Wed.)

  • Pipelining
  • Limitations of the multi-programmed design
    • Idle resources
  • Throughput of a pipelined design
    • What dictacts the throughput of a pipelined design?
  • Latency of the pipelined design
  • Dependency
  • Overhead of pipelining
    • Latch cost?
  • Data forwarding/bypassing
  • What are the ideal pipeline?
  • External fragmentation
  • Issues in pipeline designs
    • Stalling
      • Dependency (Hazard)
        • Flow dependence
        • Output dependence
        • Anti dependence
        • How to handle them?
    • Resource contention
    • Keeping the pipeline full
    • Handling exception/interrupts
    • Pipeline flush
    • Speculation
  • Interlocking
  • Multipath execution
  • Fine grain multithreading
  • No-op (Bubbles in the pipeline)
  • Valid bits in the instructions

Lecture 8 (1/31 Fri.)

  • Branch prediction
  • Different types of data dependence
  • Pipeline stalls
    • bubbles
    • How to handle stalls
    • Stall conditions
    • Stall signals
    • Dependences
      • Distant between dependences
    • Data forwarding/bypassing
    • Maintaining the correct dataflow
  • Different ways to design data forwarding path/logic
  • Different techniques to handle interlockings
    • SW based
    • HW based
  • Profiling
    • Static profiling
    • Helps from the software (compiler)
      • Superblock optimization
      • Analyzing basic blocks
  • How to deal with branches?
    • Branch prediction
    • Delayed branching (branch delay slot)
    • Forward control flow/backward control flow
    • Branch prediction accuracy
  • Profile guided code positioning
    • Based on the profile info. position the code based on it
    • Try to make the next sequential instruction be the next inst. to be executed
  • Trace cache
  • Predicate combining (combine predicate for a branch instruction)
  • Predicated execution (control dependence becomes data dependence)
  • Definition of basic blocks
  • Control flow graph

Lecture 9 (2/3 Mon.)

  • Delayed branching
    • benefit?
    • What does it eliminates?
    • downside?
    • Delayed branching in SPARC (with squashing)
    • Backward compatibility with the delayed slot
    • What should be filled in the delayed slot
    • How to ensure correctness
  • Fine-grained multithreading
    • fetch from different threads
    • What are the issues (what if the program doesn't have many threads)
    • CDC 6000
    • Denelcor HEP
    • No dependency checking
    • Inst. from different thread can fill-in the bubbles
    • Cost?
  • Simulteneuos multithreading
  • Branch prediction
    • Guess what to fetch next.
    • Misprediction penalty
    • Need to guess the direction and target
    • How to perform the performance analysis?
      • Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction.
      • Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
        • How many extra instructions are being fetched?
        • What is the performance degredation?
    • How to reduce the miss penalty?
    • Predicting the next address (non PC+4 address)
    • Branch target buffer (BTB)
      • Predicting the address of the branch
    • Global branch history - for directions
    • Can use compiler to profile and get more info
      • Input set dictacts the accuracy
      • Add time to compilation
    • Heuristics that are common and doesn't require profiling.
      • Might be inaccurate
      • Does not require profiling
    • Programmer can tell the hardware (via pragmas (hints))
      • For example, x86 has the hint bit
    • Dynamic branch prediction
      • Last time predictor
      • Two bits counter based prediction
        • One more bit for hysteresis

Lecture 10 (2/5 Wed.)

  • Branch prediction accuracy
    • Why are they very important?
      • Differences between 99% accuracy and 98% accuracy
      • Cost of a misprediction when the pipeline is veryd eep
  • Value prediction
  • Global branch correlation
    • Some branches are correlated
  • Local branch correlation
    • Some branches can depend on the result of past branches
  • Pattern history table
    • Record global taken/not taken results.
    • Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?)
  • One-level branch predictor
    • What information are used
  • Two-level branch prediction
    • What entries do you keep in the glocal history?
    • What entries do you keep in the local history?
    • How many table?
    • Cost when training a table
    • What are the purposes of each table?
    • Potential problems of a two-level history
  • GShare predictor
    • Global history predictor is hashed with the PC
    • Store both GHP and PC in one combined information
    • How do you use the information? Why does the XOR result still usable?
  • Slides (page 16-18) for a good overview of one- and two-level predictors
  • Warmup cost of the branch predictor
    • Hybrid solution? Fast warmup is used first, then switch to the slower one.
  • Tournament predictor (Alpha 21264)
  • Other types of branch predictor
    • Using machine learning?
    • Geometric history length
      • Look at branches far behind (but using geometric step)
  • Predicated execution - eliminate branches
    • What are the tradeoffs
    • What is the block is big (can lead to execution a lot of useless work)
    • Allows easier code optimization
      • From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
      • Reduce control dependences
    • Need ISA support
  • Wish branches
    • Compiler generate both predicated and non-predicated codes
    • HW design which one to use
      • Use branch prediction on an easy to predict code
      • Use predicated execution on a hard to predict code
      • Compiler can be more aggressive in optimimzing the code
    • What are the tradeoffs (slide# 47)
  • Multi-path execution
    • Execute both paths
    • Can lead to wasted work

Lecture 11 (2/12 Wed.)

  • Call and return prediction
    • Direct call is easy to predict
    • Retun is harder (indirect branches)
      • Nested calls make return easier to predict
        • Can use stack to predict the return
    • Indirect branch prediction
      • These branches have multiple targets
      • For switch-case, virtual function calls, jump tables, interface calls
      • BTB to predict the target address - low accuracy
      • History based: BTB + GHR
      • Virtual program counter prediction
  • Complications in superscalar processors
    • Fetch? What if multiple branches are fetched at the same time?
    • Logic requires to ensure correctness?
  • Multi-cycle executions (Different functional units take different number of cycles)
    • Instructions can retire out-of-order
      • How to deal with this case? Stall? Throw exceptions if there are problems?
  • Exceptions and Interrupts
    • When they are handled?
      • Why are some interrupts should be handled right away?
  • Precise exception
    • arch. state should be consistent before handling the exception/interrupts
      • Easier to debug (you see the sequential flow when the interrupt occurs)
        • Deterministic
      • Easier to recover from the exception
      • Easier to restart the processes
    • How to ensure precise exception?
      • Tradeoffs between each method
  • Reorder buffer
    • Reorder results before they are visible to the arch. state
      • Need to presearve the sequential sematic and data
    • What are the informatinos in the ROB entry
    • Where to get the value from (forwarding path? reorder buffer?)
      • Extra logic to check where the youngest instructions/value is
      • Content addressible search
        • A lot of comparators
    • Different ways to simplify the reorder buffer
    • Register renaming
      • Same register refers to independent values (lacks of registers)
    • Where does the exception happen (after retire)
  • History buffer
    • Update the register file when the instruction complete. Unroll if there is an exception.
  • Future file (commonly used, along with reorder buffer)
    • Keep two set of register files
      • An updated value (Speculative), called fiture file
      • A backup value (to restore the state quickly
    • Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)
  • Branch misprediction resembles Exception
    • The difference is that branch misprediction is not visible to the software
      • Also much more common (say, divide by zero vs. a mispredicted branch)
    • Recovery is similar to exception handling
  • Latency of the state recovery
  • What to do during the state recovery
  • Checkpointing
    • Advantages?

Lecture 14 (2/19 Wed.)

  • Predictor (branch predictor, cache line predictor …)
  • Power budget (and its importance)
  • Architectural state, precise state
  • Memory dependence is known dynamically
  • Register state is not shared across threads/processors
  • Memory state is shared across threads/processors
  • How to maintain speculative memory states
  • Write buffers (helps simplify the process of checking the reorder buffer)
  • Overall OoO mechanism
    • What are other ways of eliminating dispatch stalls
    • Dispatch when the sources are ready
    • Retired instructions make the source available
    • Register renaming
    • Reservation station
      • What goes into the reservation station
      • Tags required in the reservation station
    • Tomasulo's algorithm
    • Without precise exception, OoO is hard to debug
    • Arch. register ID
    • Examples in the slides
      • Slides 28 –> register renaming
      • Slides 30-35 –> Exercise (also on the board)
        • This will be usefull for the midterm
    • Register aliasing table

Lecture 15 (2/21 Fri.)

  • OoO –> Restricted Dataflow
    • Extracting parallelism
    • What are the bottlenecks?
      • Issue width
      • Dispatch width
      • Parallelism in the program
    • More example on slide #10
    • What does it mean to be restricted data flow
      • Still visible as a Von Neumann model
    • Where does the efficiency come from?
    • Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense?
  • Load/store handling
    • Would like to schedule them out of order, but make them visible in-order
    • When do you schedule the load/store instructions?
    • Can we predict if load/store are dependent?
    • This is one of the most complex structure of the load/store handling
    • What information can be used to predict these load/store optimization?
  • Note: IPC = 1/CPI
  • Centralized vs. distributed? What are the tradeoffs?
  • How to handle when there is a misprediction/recovery
  • Token dataflow arch.
    • What are tokens?
    • How to match tokens
    • Tagged token dataflow arch.
    • What are the tradeoffs?
    • Difficulties?

Lecture 16 (2/24 Mon.)

  • SISD/SIMD/MISD/MIMD
  • Array processor
  • Vector processor
  • Data parallelism
    • Where does the concurrency arise?
  • Differences between array processor vs. vector processor
  • VLIW
  • Compactness of an array processor
  • Vector operates on a vector of data (rather than a single datum (scalar))
    • Vector length (also applies to array processor)
    • No dependency within a vector –> can have a deep pipeline
    • Highly parallel (both instruction level (ILP) and memory level (MLP))
    • But the program needs to be very parallel
    • Memory can be the bottleneck (due to very high MLP)
    • What does the functional units look like? Deep pipelin and simpler control.
    • CRAY-I is one of the examples of vector processor
    • Memory access pattern in a vector processor
    • How to calculate the efficiency and performance of vector processors
    • What if there are multiple memory ports?
    • Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
      • Helps handling sparse metrices
    • Conditional operation
    • Structure of vector units
    • How to automatically parallelize code through the compiler?
      • This is a hard problem. Compiler does not know the memory address.
  • What do we need to ensure for both vector and array processor?
  • Sequential bottleneck
    • Amdahl's law
  • Intel MMX –> An example of Intel's approach to SIMD
    • No VLEN, use OpCode to define the length
    • Stride is one in MMX
    • Intel SSE –> Modern version of MMX

Lecture 17 (2/26 Wed.)

  • GPU
    • Warp/Wavefront
      • A bunch of threads sharing the same PC
    • SIMT
    • Lanes
    • FGMT + massively parallel
      • Tolerate long latency
    • Warp based SIMD vs. traditional SIMD
  • SPMD (Programming model)
    • Single program operates on multiple data
      • can have synchronization point
    • Many scientific applications are programmed in this manner
  • Control flow problem (branch divergence)
    • Masking (in a branch, mask threads that should not execute that path)
    • Lower SIMD efficiency
    • What if you have layers of branches?
  • Dynamic wrap formation
    • Combining threads from different warps to increase SIMD utilization
    • This can cause memory divergence
  • VLIW
    • Wide fetch
    • IA-64
    • Tradeoffs
      • Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
      • A lot of loads at the compiler level
  • Decoupled access/execute
    • Limited form of OoO
    • Tradeoffs
    • How to street the instruction (determine dependency/stalling)?
    • Instruction scheduling techniques (static vs. dynamic)
  • Systoric arrays
    • Processing elements transform data in chains
    • Develop for image processing (for example, convolution)
  • Stage processing

Lecture 18 (2/28 Fri.)

  • Tradeoffs of VLIW
    • Why does VLIW required static instruction scheduling
    • Whose job it is?
      • Compiler can rearrange basic blocks/instruction
  • Basic block
    • Benefits of having large basic block
  • Entry/Exit
    • Handling entries/exits
  • Trace cache
    • How to ensure correctness?
    • Profiling
    • Fixing up the instruction order to ensure correctness
    • Dealing with multiple entries into the block
    • Dealing with multiple exits into the block
  • Super block
    • How to form super blocks?
    • Benefit of super block
    • Tradeoff between not forming a super block and forming a super block
      • Ambiguous branch (after profiling, both taken/not taken are equally likely)
      • Cleaning up
  • What scenario would make trace cache/superblock/profiling less effective?
  • List scheduling
    • Help figuring out which instructions VLIW should fetch
    • Try to maximize instruction throughput
    • How to assign priorities
    • What if some instructions take longer than others
  • Block structured ISA (BS-ISA)
    • Problems with trace scheduling?
    • What type of program will benefit from BS-ISA
    • How to form blocks in BS-ISA?
      • Combining basic blocks
      • multiples of merged basic blocks
    • How to deal with entries/exits in BS-ISA?
      • undo the executed instructions from the entry point, then fetch the new block
    • Advantages over trace cache
  • Benefit of VLIW + Static instruction scheduling
  • Intel IA-64
    • Static instruction scheduling and VLIW

Lecture 19 (3/19 Wed.)

  • Ideal cache
    • More capacity
    • Fast
    • Cheap
    • High bandwidth
  • DRAM cell
    • Cheap
    • Sense the purturbation through sense amplifier
    • Slow and leaky
  • SRAM cell (Cross coupled inverter)
    • Expensice
    • Fast (easier to sense the value in the cell)
  • Memory bank
    • Read access sequence
      • DRAM: Activate → Read → Precharge (if needed)
    • What dominate the access laatency for DRAM and SRAM
  • Scaling issue
    • Hard to scale the scale to be small
  • Memory hierarchy
    • Prefetching
    • Caching
  • Spatial and temporal locality
    • Cache can exploit these
    • Recently used data is likely to be accessed
    • Nearby data is likely to be accessed
  • Caching in a pipeline design
  • Cache management
    • Manual
      • Data movement is managed manually
        • Embedded processor
        • GPU scratchpad
    • Automatic
      • HW manage data movements
  • Latency analysis
    • Based on the hit and miss status, next level access time (if miss), and the current level access time
  • Cache basics
    • Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc.
  • Cache access
    • How to access tag and data (in parallel vs serially)
    • How do tag and index get used?
    • Modern processors perform serial access for higher level cache (L3 for example) to save power
  • Cost and benefit of having more associativity
    • Given the associativity, which block should be replace if it is full
    • Replacement poligy
      • Random
      • Least recently used (LRU)
      • Least frequently used
      • Least costly to refetch
      • etc.
  • How to implement LRU
    • How to keep track of access ordering
      • Complexity increases rapidly
    • Approximate LRU
      • Victim and next Victim policy

Lecture 20 (3/21 Fri.)

  • Set thrashing
    • Working set is bigger than the associativity
  • Belady's OPT
    • Is this optimal?
    • Complexity?
  • Similarity between cache and page table
    • Number of blocks vs pages
    • Time to find the block/page to replace
  • Handling writes
    • Write through
      • Need a modified bit to make sure accesses to data got the updated data
    • Write back
      • Simpler, no consistency issues
  • Sectored cache
    • Use subblock
      • lower bandwidth
      • more complex
  • Instruction vs data cache
    • Where to place instructions
      • Unified vs. separated
    • In the first level cache
  • Cache access
    • First level access
    • Second level access
      • When to start the second level access
        • Performance vs. energy
  • Address translation
  • Homonym and Synonyms
    • Homonym: Same VA but maps to different PA
      • With multiple processes
    • Synonyms: Multiple VAs map to the same PA
      • Shared libraries, shared data, copy-on-write
      • I/O
    • Can these create problems when we have the cache
    • How to eliminate these problems?
      • Page coloring
  • Interaction between cache and TLB
    • Virtually indexed vs. physically indexed
    • Virtually tagged vs. physically tagged
    • Virtually indexed physically tagged
  • Virtual memory in DRAM
    • Control where data is mapped to in channel/rank/bank
      • More parallelism
      • Reduce interference

Lecture 21 (3/24 Mon.)

  • Different parameters that affect cache miss
  • Thrashing
  • Different types of cache misses
    • Compulsory misses
      • Can mitigate with prefetches
    • Capacity misses
      • More assoc
      • Victim cache
    • Conflict misses
      • Hashing
  • Large block vs. small block
  • Subblocks
  • Victim cache
    • Small, but fully assoc. cache behind the actual cache
    • Cached misses cache block
    • Prevent ping-ponging
  • Pseudo associativity
    • Simpler way to implement associative cache
  • Skewed assoc. cache
    • Different hashing functions for each way
  • Restructure data access pattern
    • Order of loop traversal
    • Blocking
  • Memory level parallelism
    • Cost per miss of a parallel cache miss is less costly compared to serial misses
  • MSHR
    • Keep track of pending cache
      • Think of this as the load/store buffer-ish for cache
    • What information goes into the MSHR?
    • When do you access the MSHR?

Lecture 22 (3/26 Wed.)

  • Multi-porting
    • Virtual multi-porting
      • Time-share the port, not too scalable but cheap
    • True multiporting
  • Multiple cache copies
  • Banking
    • Can have bank conflict
    • Extra interconnects across banks
    • Address mapping can mitigate bank conflict
    • Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
  • Accessing DRAM
    • Row bits
    • Column bits
    • Addressibility
    • DRAM has its own clock
  • DRAM (2T) vs. SRAM (6T)
    • Cost
    • Latency
  • Interleaving in DRAM
    • Effects from address mapping on memory interleaving
    • Effects from memory access patterns from the program on interleaving
  • DRAM Bank
    • To minimize the cost of interleaving (Shared the data bus and the command bus)
  • DRAM Rank
    • Minimize the cost of the chip (a bundle of chips operated together)
  • DRAM Channel
    • An interface to DRAM, each with its own ranks/banks
  • DIMM
    • More DIMM adds the interconnect complexity
  • List of commands to read/write data into DRAM
    • Activate → read/write → precharge
    • Activate moves data into the row buffer
    • Precharge prepare the bank for the next access
  • Row buffer hit
  • Row buffer conflict
  • Scheduling memory requests to lower row conflicts
  • Burst mode of DRAM
    • Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
  • Address mapping
    • Row interleaved
    • Cache block interleaved
  • Memory controller
    • Sending DRAM commands
    • Periodically send commands to refresh DRAM cells
    • Ensure correctness and data integrity
    • Where to place the memory controller
      • On CPU chip vs. at the main memory
        • Higher BW on-chip
    • Determine the order of requests that will be serviced in DRAM
      • Request queues that hold requests
      • Send requests whenever the request can be sent to the bank
      • Determine which command (across banks) should be sent to DRAM
  • Priority of demand vs. prefetch requests
  • Memory scheduling policies
    • FCFS
    • FR-FCFS
      • Capped FR-FCFS: FR-FCFS with a timeout
      • Usually this is done in a command level (read/write commands and precharge/activate commands)

Lecture 23 (3/28 Fri.)

  • DRAM design choices
    • Cost/density/latency/BW/Yield
  • Sense Amplifier
    • How do they work
  • Dual data rate
  • Subarray
  • Rowclone
    • Moving bulk of data from one row to others
    • Lower latency and BW when performing copies/zeroes out the data
  • TL-DRAM
    • Far segment
    • Near segment
    • What causes the long latency
    • Benefit of TL-DRAM
      • TL-DRAM vs. DRAM cache (adding a small cache in DRAM)

Lecture 24 (3/31 Mon.)

  • Memory controller
    • Different commands
  • Memory scheduler
    • Determine the order of requests to be issued to DRAM
    • Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality
  • Row buffer
    • hit/conflict
    • open/closed row
    • Open row policy
    • Closed row policy
    • Tradeoffs between open and closed row policy
      • What if the programs has high row buffer locality: open row might benefit more
      • Closed row will service misses request faster
  • Bank conflict
  • Interference from different applications/threads
    • Differnt programs/processes/threads interfere with each other
      • introduce more row buffer/bank conflicts
    • Memory schedule has to manage these interference
    • Memory hog problems
  • Interference in the data/command bus
  • FR-FCFS
    • Why does FR-FCFS make sense?
      • Row buffer has lower lantecy
    • Issues with FR-FCFS
      • Unfairness
  • STFM
    • Fairness issue in memory scheduling
    • How does STFM calculate the fairness and slowdown
      • How to estimate slowdown time when it is runing alone
    • Definition of fairness (based on STFM, different papers/areas define fairness differently)
  • PAR-BS
    • Parallelism in programs
    • Intereference across banks
    • How to form a batch
    • How to determine ranking between batches/within a batch

Lecture 25 (2/2 Wed.)

  • Latency sensitivity
    • Performance drops a lot when the memory request latency is long
  • TCM
    • Tradeoff between throughput and fairness
    • Latency sensitive cluster (non-intensive cluster)
      • Ranking based on memory intensity
    • Bandwidth intensive cluster
      • Round robin within the cluster
    • Generally latency sensitive cluster has more priority
    • Provide robust fairness vs. throughput
    • Complexity of TCM?
  • Different ways to control interference in DRAM
    • Partitioning of resource
      • Channel partitioning: map applications that interfere with each other in a different channel
        • Keep track of application's characteristics
        • Dedicate a channel might waste the bandwidth
        • Need OS support to determine the channel bits
    • Source throttling
      • A controller throttle the core depends on the performance target
      • Example: Fairness via source throttling
        • Detect unfairness and throttle application that is interfering
        • How do you estimate slowdown?
        • Threshold based solution: hard to configure
    • App/thread scheduling
      • Critical threads usually stall the progress
    • Designing DRAM controller
      • Has to handle the normal DRAM operations
        • Read/write/refresh/all the timing constraints
      • Keep track of resources
      • Assign priorities to different requests
      • Manage requests to banks
    • Self-optimizing controller
      • Use machine learning to improve DRAM controller
  • DRAM Refresh
    • Why does DRAM has to refresh every 64ms
    • Banks are unavailable during refresh
      • LPDDR mitigate this by using a per-bank refresh
    • Has to spend longer time with bigger DRAM
    • Distributed refresh: stagger refresh every 64 ms in a distributed manner
      • As oppose to burst refresh (long pause time)
  • RAIDR: Reduce DRAM refresh by profiling and binning
    • Some row do not have to be refresh very frequently
      • Profile the row
        • High temperature changes the retention time: need online profiling
  • Bloom filter
    • Represent set membership
    • Approximated
    • Can contain false positive
      • Better/more hash function helps eliminate this

Lecture 26 (4/7 Mon.)

  • Tolerate latency can be costly
    • Instruction window is complex
      • Benefit also diminishes
    • Designing the buffers can be complex
    • A simpler way to tolerate out of order is desirable
  • Different sources that cause the core to stall in OoO
    • Cache miss
    • Note that stall happens if the inst. window is full
  • Scaling instruction window size is hard
    • It is better (less complex) to make the windows more efficient
  • Runahead execution
    • Try to optain MLP w/o increasing instruction windows
    • Runahead (i.e. execute ahead) when there is a long memory instruction
      • Long memory instruction stall processor for a while anyways, so it's better to make use out of it
      • Execute future instruction to generate accurate prefetches
      • Allow future data to be in the cache
    • How to support runahead execution?
      • Need a way to checkpoing the state when entering runahead mode
      • How to make executing in the wrong path useful?
      • Need runahead cache to handle load/store in Runahead mode (since they are speculative)
    • Cost and benefit of runahead execution (slide number 27)
    • Runahead can have inefficiency
      • Runahead period that are useless
        • Get rid of useless inefficient period
    • What if there is a dependent cache miss
      • Cannot be paralellized in a vanilla runahead
      • Can predict the value of the dependent load
        • How to predict the address of the load
          • Delta value information
          • Stride predictor
          • AVD prediction

Lecture 28 (4/9 Wed.)

  • Questions regarding prefetching
    • What to prefetch
    • When to prefetch
    • how do we prefetch
    • where to prefetch from
  • Prefetching can cause thrasing (evict a useful block)
  • Prefetching can also be useless (not being used)
    • Need to be efficient
  • Can cause memory bandwidth problem in GPU
  • Prefetch the whole block, more than one block, or subblock?
    • Each one of them has pros and cons
    • Big prefetch is more likely to waste bandwidth
    • Commonly done in a cache block granularity
  • Prefetch accuracy: fraction of useful prefetches out of all the prefetches
  • Prefetcher usually predict based on
    • Past knowledge
    • Compiler hints
  • Prefetcher has to prefetch at the right time
    • Prefetch that is too early might get evicted
      • It might also evict other useful data
    • Prefetch too late does not hide the whole memory latency
  • Previous prefetches at the same PC can be used as the history
  • Previous demand requests also is a good information to use for prefetches
  • Prefetch buffer
    • Place the prefetch data to avoid thrashing
      • Can treat demand/prefetch requests separately
      • More complex
  • Generally, demand block is more important
    • This means eviction should prefer prefetch block as oppose to demand block
  • Tradeoffs between where do we place the prefetcher
    • Look at L1 hits and misses
    • Look at L1 misses only
    • Look at L2 misses
    • Different access pattern affect accuracy
      • Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
  • Software vs. hardware vs. execution based prefetching
    • Software: ISA previde prefetch instructions, software utilize it
      • What information are useful
      • How to make sure the prefetch is timely
      • What if you have a pointer based structure
        • Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
          • Can be solved by hinting the nextnext and/or nextnextnext address
    • Hardware: Identify the pattern and prefetch
    • Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
  • Stride prefetcher
    • Predict strides, which is common in many programs
    • Cache block based or instruction based
  • Stream buffer design
    • Buffer the stream of accesses (next address)
    • Use the information to prefetch
  • What affect prefetcher performance
    • Prefetch distance
      • How far ahead should we prefetch
    • Prefetch degree
      • How many prefetches do we prefetch
  • Prefetcher performance
    • Coverage
      • Out of the demand requests, how many are actually from the prefetch request
    • Accuracy
      • Out of all the prefetch requests, how many are actually getting used
    • Timeliness
      • How much memory latency can we hide from the prefetch requests
  • Feedback directed prefetcher
    • Use the result of the prefetcher as a feedback to the prefetcher
      • with accuracy, timeliness, polluting information
  • Markov prefetcher
    • Prefetch based on the previous history
    • Use markov model to predict
    • Pros: Can cover arbitary pattern (easy for link list traversal or trees)
    • Downside: High cost, cannot help with compulsory misses (no history)
  • Content directed prefetching
    • Indentify the content in memory for pointers (which is used as the address to prefetch
    • Not very efficient (hard to figure out which block is the pointer)
      • Software can give hints

Lecture 28 (4/14 Mon.)

  • Execution based prefetcher
    • Helper thread/speculative thread
      • Use another thread to pre-execute a program
    • Can be a software based or hardware based
    • Discover misses before the main program (to prefetch data in a timely manner)
    • How do you construct the helper thread
    • Preexecute instruction (one example of how to initialize a speculative thread), slide 9
  • Benefit of multiprocessor
    • Improve performace without not significantly increase power consumption
    • Better cost efficiency and easier to scale
    • Improve dependability (in case the other core is faulty
  • Different types of parallelism
    • Instruction level parallelism
    • Data level parallelism
    • Task level parallelism
  • Task level parallelism
    • Partition a single, potentially big, task into multiple parallel sub-task
      • Can be done explicitly (parallel programming by the programmer)
      • Or implicitly (hardware partition a single thread specilatively)
    • Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement)
  • Loosely coupled multiprocessor
    • No shared global address space
      • Message passing to communicate between different sources
    • Simple to manage memory
  • Tightly coupled multiprocesor
    • Shared global address space
    • Need to ensure consistency of data
  • Switch on event multithreading
    • Switch to another context when there is an event (for example, a cache miss)
  • Simulteneous multithreading
    • Dispatch instruction from multiple threads at the same time
  • Amdahl's law
    • Maximum speedup
    • Parallel portion is not perfect
      • Serial bottleneck
      • Synchronization cost
      • Load balance
        • Some threads has more work, requires more time to hit the sync. point
  • Issue in parallel programming
    • Correctness
    • Synchronization
    • Consistency

Lecture 29 (4/16 Wed.)

  • Ordering of instructions
    • Maintaining memory consistency when there are multiple threads and shared memory
    • Need to ensure the semantic is not changed
    • Making sire the shared data is properly locked when used
      • Support mutual exclusion
    • Ordering depends on when each processor is executed
    • Debugging is also difficult (non-deterministic behavior)
  • Weak consistency: global ordering when sync
    • programmer hints where the synchronizations are
  • Total store order model: global ordering only with store
  • Cache coherence
    • Can be done in the software level or hardware level
  • Coherence protocol
    • Need to ensure that all the processors see and update the correct state of the cache block
    • Need to make sure that writes get propagated and serialized
    • Simple protocol are not scalable (one point of synchrnization)
  • Update vs. invalidate
    • For invalidate, only the core that needs to read retains the correct copy
      • Can lead to ping-ponging (tons of read/writes from several processors)
    • For updates, bus becomes the bottleneck
  • Snoopy bus
    • Bus based, single point of serialization
    • More efficient with small number of processors
    • All cache snoop other caches read/write requests to keep the cache block coherent
  • Directory based
    • Single point of serialization per block
    • Directory coordinate the coherency
    • More scalable
    • The directory keeps track of where the copies of each block resides
      • Supply data on a read
      • Invalide the block on a write
      • Has an exclusive state
  • MSI coherent protocol
    • Slide number 56-57
    • Consume bus bandwidth (need an “exclusive” state
  • MESI coherent protocal
    • Add the exclusive state: this is the only cache copy and it is clean state to MSI
  • Tradeoffs between snooping and directory based
    • Slide 71 has a good summary on this
  • MOESI
    • Improvement over MESI protocol

Lecture 29 (4/18 Wed.)

  • Interference
  • Complexity of the memory scheduler
    • Ranking/prioritization has cost
    • Complex scheduler has higher latency
  • Performance metric for multicore/multithead applications
    • Speedup
    • Slowdown
    • Harmonic vs wrighted
  • Fairness mertic
    • Maximum slowdown
      • Why does it make sense
      • Any scenario that it does not make sense?
  • Predictable performance
    • Why is it important?
      • In server environment, different jobs are on the same server
      • In a mobile environment, there are multiple sources that can slowdown other sources
    • How to relate slowdown with request service rate
    • MISE: soft slowdown guarantee
  • BDI
    • Memory wall
      • What is the concern regarding the memory wall
    • Size of the cache on the die (CPU die)
    • One possible solution: cache compression
      • What is the problems of existing cache compression mechanism
        • Some are too complex
        • Decompression is in the critical path
          • Need to decompress when reading the data → decompression should not be in the critical path
          • Important factor to the performance
    • Software compression is not good enough to compress everything
    • Zero value compression
      • Simple
      • Good compression ratio
      • What is data does not have many zeroes
    • Frequent value compression
      • Some data appear fequently
      • Simple and good compression ratio
      • have to profile
      • decompression is complex
    • Frequent pattern compression
      • Still to complex in terms of decompression
    • Based delta compression
      • Easy to decompress but retain the benefit of compression

Lecture 31 (4/28 Mon.)

  • Directory based cache coherent
    • Each directory has to handle validate/invalidation
    • Extra cost of syncronization
    • Need to ensure race conditions are resolved
  • Interconnection
    • Topology
      • Bus
      • Mesh
        • Torus
      • Tree
      • Butterfly
      • Ring
        • Bi-directional ring
          • More scalable
        • Hierarchical ring
          • Even more scalable
          • More complex
      • Crossbar
      • etc.
    • Circuit switching
    • Multistage network
      • Butterfly
      • Delta network
    • Handling contention
      • Buffering vs. dropping/deflection (no buffering)
    • Routing algorithm
      • Handling deadlock
      • X-Y routing
        • Turn model (to avoid deadlocks)
      • Add more buffering for an escape path
      • Oblivious routing
        • Can take different path
          • DOR between each intermediate location
        • Balance network load
      • Adaptive routing
        • Use the state of the network to determine the route
          • Aware of local and/or global congestions
        • Non minimal adaptive routing can have livelocks

Lecture 32 (4/30 Wed.)

  • Serialized code section
    • Degrade performance
    • Waste energy
  • Heterogeneous cores
    • Can execute serialized portion on a powerful large core
  • Tradeoff between multiple small cores, multiple large cores or heterogenerous cores
  • Critical section
    • bottleneck in several multithreaded workloads
    • Assymmetry can help
    • Accelerated critical section
      • Use a large core to run serialized portion of the code
      • How to correctly support ACS
      • False serialization
      • Handling private/shared data
    • BIS
      • Ideltify the bottleneck
        • Serial bottleneck
        • Barrier
        • Critical section
        • Pipeline stages
      • Application might wait on different types of bottlenecks
      • Allow bottleneckcall and bottleneckreturn
      • Acceleration can be done in multiple ways
        • ship to a big core
        • increase the frequency
        • Priorize the thread in share resources (memory scheduler always schedule reqeusts from the thread first, etc.)
      • Bottleneck table keeps track of different thread's bottleneck and determine the criticality

Lecture 33 (5/2 Fri.)

  • DRAM scaling problem
  • Possible solutions to the scaling problem
    • Less leakage DRAM
    • Heterogeneous DRAM (TL-DRAM, etc.)
    • Add more functionality to DRAM
    • Denser design (3D stack)
    • Different technology
      • NVM
  • Non volatile memory
    • Resistive memory
      • PCM
        • Inject current to change the phase
        • Scales better than DRAM
          • Multiple bits per cell
            • Wider resistence range
        • No refresh is needed
        • Downside: Latency and write endurance
      • STT-MRAM
        • Inject current to change the polarity
      • Memristor
        • Inject current to change the structure
    • Persistency - data stay there even without power
      • Unified memory and storage management (persistent data structure) - Single level store
        • Improve energy and performance
        • Simplify programming model
buzzword.txt · Last modified: 2014/05/02 14:14 by rachata