Lecture 1

  • Hardware/software interface
  • Programmer/Hardware Designer
  • Design decisions: Example: high performance single core vs low performance multi cores
  • Levels of transformation
  • Instruction Set Architecture (ISA)
  • Abstraction
    • Benefits
    • Why cross layers?
  • GPUs, vector machines vs scalar machines
  • GPGPUs
  • VLIW - Very Long Instruction Word Architectures
  • Memory hierarchy
  • Multi core systems
    • Shared resources
    • Main memory
    • Memory controller
    • Unfairness at the memory controller
  • DRAM organization/operation
    • Rows
    • Columns
    • Row-Buffer
    • Row-hit, row-conflict
  • Compile time vs run time
  • not known at compile time
    • load latency
    • branch direction
    • address of memory operations
  • Compiler profiling
  • Amdahl's definition of architecture

Lecture 2

  • Hardware/Software - this course is at the interface
  • better software/hardware design possible if you understand the other
  • mmx multimedia extensions to x86 in 1995
  • Enabling new applications
    • Virtual reality
    • Personal genomics
  • Typical 1990s ideal computer
    • Single thread
    • Pipelined
    • Out of order execution
    • Superscalar
  • Current day: multicore
  • Today's constraints
    • Energy
    • Complexity
    • Fault tolerance
    • Memory wall
  • Out of order execution at compile time vs run time
  • Bigger caches on chip?
    • tradeoff
    • IBM Z series - large last level cache - used in banking that earlier used mainframes
  • von Neumann model
  • Instruction processing cycle
    • Fetch
    • Decode
    • Fetch operands
    • Execute
    • Compute next inst address
  • Program counter/Instruction pointer
  • Data flow model
    • Firing
    • Tag matching
  • von Neumann vs Data flow
  • ISA vs Micro architecture
  • Precise state (von Neumann) vs imprecise state (Data flow)
  • Current systems: Restricted data flow
  • Major ISAs - x86, SPARC, MIPS
  • Multiple pipelines vs multiple cores
  • Architecture - ISA + Microarchitecture
  • Backward compatibility
    • Binary Coded Decimal (BCD)example
  • Condition codes
  • Design point
  • Problem space/application spac
  • SUN MAJC microprocessor - designed for JAVA programs

Lecture 3

  • REP MOVS
    • In x86, a single instruction
    • In other ISAs, several instructions
  • von Neumann
    • Stored program
    • Sequential processing
  • DEC: VAX
    • More than 300 instructions
    • Complicated
    • Index instruction, instructions that operate on doubly linked lists
  • CISC vs RISC
  • Instruction
    • Opcode
    • Operand
  • LC3
    • Fixed length instructions
    • Uniform decode
    • Bit steering
  • Serial decode
  • Data types
  • Atomicity
  • Semantic gap
  • Intel 432
  • Address space
    • Addressability
    • Big Endian/Little Endian
  • Memory hierarchy
  • Programmer visible vs Programmer invisible
  • Addressing modes
  • Indirect addressing mode → pointer operations
  • Sparse matrix accesses
  • ISA orthogonality → VAX - classic example
  • Memory mapped vs Special I/O
  • Privilege modes
  • Exception/Interrupt handling

Lecture 4

  • Orthogonality
  • Auto-increment addressing mode
  • Complex vs Simple instructions
  • Semantic gap
  • Translation: Implementation ISA
  • Micro instructions
  • x86 - small semantic gap; Hardware translates x86 instructions into simpler ones
  • Transmeta - tanslates x86 into secret VLIW
  • Vax: Small semantic gap example
  • IBM 801 mini computer - very large semantic gap; John Cock
  • Dynamic/static interface
  • Intel 432
    • Huffman encoding to code frequently used instructions with small number of bytesFixed vs variable length ISA
  • Uniform vs non-uniform decode
  • Uniform decode
    • Enables parallelism
    • Speculative execution
  • Non-uniform decode
    • Serial decode
  • Addressing modes
    • x86: Scale, base, index, displacement
  • Other ISA level tradeoffs
  • Hardware vs software interlocking
  • Virtual memory vs overlay programming
  • Microarchitecture
    • Multiple choices to implement the same ISA
  • Single cycle instruction processing

Lecture 5

  • Semantic gap
    • Benefits of translating in hardware vs translating in software
  • Variable length, uniform decode ISA - possible?
    • Uniform decode typically goes with fixed length
  • RISC vs CISC
  • RISC
    • Simple instructions
    • Fixed length
    • Uniform decode
    • Few addressing modes
  • CISC
    • Complex instructions
    • Variable length
    • Non-uniform decode
    • Many addressing modes
  • Uniform decode
    • Hard to get perfect uniform decode
    • MIPS - uniform within instruction types
  • Semantic mismatch
  • How about multiple levels of translation
    • Overhead vs value add
  • Basic instruction processing
  • Single cycle
    • Only combinational logic
    • Critical path determined by slowest instruction
  • Single-cycle vs Multi-cycle machines
  • Datapath
  • Single vs Multiple ALUs
  • Control signals
  • Instruction processing
  • 2 separate adders needed for address generation and ALU operation
  • R-type datapath
  • I-type datapath
  • Combining R-type and I-type datapaths
  • Load/Store datapath
  • Non-control-flow datapath
  • Branch delay slots
  • Unconditional jump datapath

Lecture 6

Single cycle implementation

  • JALR instruction phases
    • Fetch, Decode, Evaluate address, Store result
  • Datapath vs control logic
  • Semantics of a delayed branch
  • Control logic
    • Combinational logic - hardwired control
    • Sequential logic - microprogrammed control
  • 2 components to performance
    • Number of cycles
    • Cycle time
  • Critical path
    • Determined by slowest instruction to process
    • Time it takes
      • to access memory
      • to go through ALU
      • to determine the control signals
  • CDC 5600
    • Control logic could also affect the critical path
    • Control store access took too long
    • Fetched 2 control words on each access to amortize cost
  • Memory latency
    • High complexity - replicate resources needed more than once by an instuction

Micro architecture design principles

  • Critical path design
  • Bread and butter (common case) design
  • Balanced design

Performance analysis

  • Execution time
    • CPI
    • Clock cycle time
    • For single instruction, CPI and clock cycle time are at odds

Microcoded implementation

  • Microinstruction
  • Microsequencing
  • Microsequencer
  • Generating control signals in same cycle as datapath processing - bad
  • LC3-b datapath
    • MAR - Memory Address Register
    • MDR - Memory Data Register
    • Gating
    • J bits determine address of the next state

Lecture 7

Simulator design

  • Accuracy vs speed
  • Functional accuracy vs Timing accuracy
  • Determines
    • design space exploration time
    • time to market

Microprogramming

  • Details of state machine - microarchitectural choice
  • Control signals
    • some stored in the control store
    • some generated by the datapath
    • JSRR in LC-3b and JALR in LC-3b
  • Example: JSR vs JSRR in LC-3b
  • Design of microsequencer – depends on state encoding

Microcoding of an instruction

  • Unconditional vs conditional transitions between states
  • J bits determine transition between states
  • Conditional transitions: Some transitions occur on specific conditions
  • In-class exercise: Microcoding of states 33 and 18

Memory IO

  • word alignment
  • Look at MAR - Bottom bits of the memory address register
  • Write enables: depend on address
    • cannot come from the control store
  • Tradeoff between generating control signals usirdware can be simple. No need to detect
    • Fine-grained Multi threading. Ex: Sun Niagara, Intel Larrabee
      • Predict and execute “speculatively”
    • Interlocking
  • Detect and stall vs scoreboarding
    • Score boarding needs to stall for other kinds of dependences
    • Detect and stall - dependency check logic is complex
  • Superscalar execution

Resource contention

  • Resource contention → also called resource dependence sometimes
  • Solutions
    • Duplicate resource Ex: separate instruction and data caches, Use multiple ports
    • Stall one of the contenders
  • Why (preferably) not a single 2-ported memory in a pipelined design?
    • Increases additional wiring across different stages
  • Whom to stall when a read and a write contend?
    • The read, because letting the write finish would enable dependent instructions to execute

Lecture 10

Side note: Errors in hardware

  • Allow hardware to produce errors
  • Relaxed correctness

Review of Pipelining

  • Ideal vs non-ideal pipelines
  • Pipeline registers/control
  • Balancing work
  • Pipeline stalls/bubbles
    • Bubble: If a stage takes more than one cycle, insert a bubble/nop
    • Stall signal needs to be propagated to previous stages
  • Score boarding vs hardware combinational logic
  • Doing dependency checking in the compiler
    • Advantage: simpler hardware
    • Disadvantages:
      • Compiler might not find useful work, so nops get inserted. This could increase code size.
      • Can't handle branches and other variable outcome/latency events
    • Dependency
    • Dependency types
  • Does every stage have dependency checking logic?
    • In a pipeline with no forwarding, dependency checking logic in one stage is enough
  • Detect and Stall
  • Detect and Forward
  • Fine grained multi-threading (used in old super computers)
    • Fetch from a different thread every cycle
    • No need to detect dependency in software or hardware
  • Value prediction - Predict the value in the register
    • Check if correct
    • Correct: no action
    • Incorrect: Flush pipeline and reexecute
  • Control dependence
    • Which instruction to fetch next? depends on the current instruction
    • When do you know the next instruction is control flow?
    • Need to determine if next instruction is control flow when you fetch
    • Could stall pipeline on a branch: insert bubbles

Pipeline implementation: in general

  • Safe and unsafe movement of pipeline
  • RAW dependence analysis example
  • Stall conditions
    • Comprehensive check between source registers and destination registers
  • Impact of stall on performance
    • Each stall means 1 lost ALU cycle
    • Depends on distance between dependences

Data forwarding

  • Communicate value produced by source instruction Is to consumer instruction Ic
  • Register file is but a medium to do this
  • So, provide a path separate from the register file to communicate value
  • Many data forwarding paths
    • From the output of the ALU
    • From the output of memory
  • Forwarding cannot always help move the pipeline
  • Load operation followed by a dependent store - can't forward

LC-3b pipeline implementation

  • Conservative
  • Stalls on flow dependences
  • Dependency stall and branch stall signals
  • MemPCMux determined by logic
  • 3 types of control signals
    • Datapath
    • Control store
    • Stall

Control dependence handling

  • Branch type
    • Conditional
    • Unconditional
    • Call
    • Return
    • Indirect
  • Return and indirect branches - can have many targets
  • Handling control dependences: at a high level
    • Critical to keep pipeline full
    • Predicting branches: Taken vs Not-taken
    • Around 60% branches are always taken
    • Predicting more intelligently - branch prediction
      • Example: Take the path it took last time
    • Delayed branching: branch delay slots
    • Fine grained multithrrmance this way
  • CMOV instruction
  • Adv: Good if misprediction cost > useless work due to predication
  • Disadv: Three sources
  • Disadv: Compiler technique; so not adaptive
  • Predicated execution in Intel Itanium
  • Conditional execution in ARM ISA

Multi-path execution

  • Execute both paths after a conditional branch
  • Do confidence estimation
  • Execute both paths on a branch
  • Each path requires its own context
  • Paths followed become exponential

Other branches

  • Returns
  • Returns have matching calls
  • Use stack to predict return address
  • Indirect branches: many targets
    • Ex: switch case statements, virtual function calls
  • Indirect branch prediction
    • No direction prediction needed
    • Solution 1: Predict last resolved target as next address
    • Solution 2: Use branch history

Issues in branch prediction

  • Prediction is latency critical
  • More complicated a pipeline width increases
    • Index four times?
    • Have four ports in the branch predictor

Multiple instruction fetch

  • Flynn's bottleneck
  • VLIW - compiler decides what can be executed in parallel
  • Superscalar - hardware detects dependences between instructions fetched in the same cycle
  • Compiler vs hardware
  • Metaflow
    • Had a large number of instructions fetched in a cycle
    • Compiler had to find that many independent instructions
    • Suffered from the NOP problem

Precise Exceptions

  • Multi-cycle execute
  • What if floating point multiply incurs an exception?
  • Exceptions vs Interrupts
    • Exception: internal to the process
    • Interuupt: external to the process
  • When to service
    • Exceptions - immediate
    • Interrupts - when convenient

Lecture 13

Precice Exceptions/Interrupts

  • Precise state
  • Consistent architectural state when exception/interrupt executed
  • Why precise exceptions?
    • To preserve sequential (von Neumann) semantics
    • Makes software debugging and recovery from exceptions easier

How to ensure precise exceptions?

  • Make all operations take same time
    • Bad idea
  • Reorder buffer (ROB)
  • Future File (FF)

Reorder buffer

  • Keeps track of oldest and youngest instruction
  • Instructions write to ROB on completion
  • ROB entry (PC, exception bit, valid bit etc.,)
  • Instructions allowed to updated arch state when they become oldest
  • Exceptions become visible in program order too
  • What if later operation needs value in ROB?
  • What if you run out of ROB entries?
    • ROB's a circular buffer
  • Implementation of the ROB
  • Register renaming
    • ROB gives illusion of a larger register file

Future File

  • ROB to ensure updates to arch register file
  • Instructions write to Future file and ROB.
  • Adv: In decode stage, access only future file. So fast
  • On exception, arch reg file copied to future file

Out-of-order execution

  • Data dependences stall the pipeline
  • Preventing dispatch stalls
    • Move dependent instructions out of the way of independent instructions
  • Enabling OoO execution
    • Need to link producers and consumers
    • Buffer instructions until they are ready
    • Track readiness of each instruction's sources
    • Dispatch instruction when sources available
  • Register Alias table
  • Tomasulo's algorithm
    • Read source registers from RAT
    • Allocate RS entry
    • Place sources into RS entry
    • Rename destination register
    • Select logic to arbitrate between ready instructions at each execution unit
  • In summary
    • Renaming
    • Dynamically built dataflow

Lecture 15

Out of Order Execution

  • Precise expections
  • Reorder buffer
  • Register Alias Table
  • Future file
  • Tag broadcast → dataflow like
  • OoO engine - builds dataflow graph of a piece of a program
  • Latency tolerance
    • a long latency operation
    • independent dependence chains
    • Degree of latency tolerance determined by instruction window/ROB size
  • Instruction window
  • Degree of independence within a single threaded program?
  • Branch prediction in an out-of-order machine
  • Combining super scalar and out-of-order
  • Distinction between super scalar and out-of-order
  • Load/Store dependencies in an out-of-order execution engine
  • Memory disambiguation problem
  • Reservation stations - distrubuted vs centralized?
  • Irregular parallelism

Data Flow

  • OoO machines - Restricted data flow at the ISA level
  • Conditional, relational, barrier synchronization nodes
  • Tokens
    • <instruction pointer, port, data>
  • Control flow vs Data flow
  • Different parts of a data flow machine
    • Matching area
    • Instruction fetch area
  • Data flow processing element
  • MIT tagged token dataflow architecture
  • Data flow → Asynchronous
  • Manchester data flow machine
  • OoO vs Data flow
    • OoO implements data flow under the hood
    • Preserves sequential semantics

Vector Processing

  • Exploiting regular parallelism
  • Single Instruction Multiple Data (SIMD)
  • Time Space Duality
    • Array processor
    • Vector processor
  • VLIW - Very Long Instruction Word

Lecture 16

  • SISD
  • SIMD
  • MISD
    • does not really exist, systolic array processors - a close approximation
  • MIMD
  • SIMD
  • Array vs Vector processor
  • Vector processing
    • Row major, Column major
    • Stride - distance between two elements of a vector
    • Main Advantages
      • Amortizes cost of instruction fetch, decode
    • Main Disadvantages
      • Works only if parallelism is regular
    • Vector vs Scalar operation
    • Amdahl's law
    • Sequential bottlenecks - degrade performance in a vector processor
    • Vector registers
    • Vector functional units
    • Multiporting
    • Vector Memory System
      • Vector Registers
      • Address generator
      • Base/Stride
      • Memory banks
      • Bank conflicts
    • Vector chaining: data forwarding from one vector functional unit to another
    • What is vector is not stored in a strided fashion in memory?
      • Gather/Scatter operations
    • Masked operations
      • These are really predicated operations.
      • Density-time implementation
    • Automatic code vectorization
    • SIMD
      • Intel MMX instructions
  • GPU
    • Thread warp: A set of threads that execute the same instruction
    • Warp-based SIMD vs Traditional SIMD
    • Latency hiding with thread warps
    • Fine-grained multi threading
  • SPMD
    • Branch divergence problem
  • VLIW
    • Compiler
    • Microarchitecture dependent - exposing microarchitecture to the compiler
    • No dependency check between and across instructions
    • Commercial VLIW machines
      • Multiflow TRACE
      • Transmata Crusoe
      • TI C6000, Trimedia, STMicro
      • Intel-64
        • Instruction bundles
    • VLIW NOP problem
    • Intimately tied with microarchitecture
    • VLIW optimizations being applied in superscalar machines

Lecture 17

Memory

  • Latency
  • Bandwidth
  • AMD Barcelona
  • Ideal memory
    • Zero access time
    • Zero cost
    • Infinite capacity
  • Ideal memory requirements - opposing
    • large, fast memory vs small, slow memory
  • Two key memory technologies
    • DRAM
    • SRAM

DRAM

  • Rows
  • Columns
  • Row decoder
  • Column decoder
  • Bit lines, Word lines
  • Column address supplied after row address - to save pins
  • Refresh

Memory hierarchy

  • Large, fast memory vs small, slow memory
  • Register file also part of memory hierarchy
  • Memory locality
    • Temporal
    • Spatial

Caching

  • Cache line/block
  • Ideal Load-to-use-latency = 1 cycle
  • Cache hierarchy
  • Manual vs automatic memory management
  • Hit rate, Miss rate
  • Hit/miss penalty
  • Pentium 4 hit/miss latency numbers
  • Important cache decisions
    • Placement
    • Replacement
    • Separate I and D caches vs unified I/D cache
  • Tag store/Data store
  • Tag bits
  • Index bits
  • Conflict miss
  • Set associativity
  • Full associativity?
  • Replacement in set associative caches
    • Random
    • FIFO
    • Least recently used
    • Least costly to re-fetch
    • Optimal?
  • Implementing perfect LRU
    • Expensive
    • Other alternatives
      • Not MRU
      • Hierarchical LRU
  • Belady's optimal
  • Tag store entry
    • Valid bit
    • Tag
    • Replacement policy bits
    • Dirty bit - write back vs write through?

Lecture 18

Caches (continued)

  • Write back vs Write through
  • Cache consistency/coherence
  • Allocate/No-allocate on a write miss
  • Sectored caches
  • Separate Instruction, data caches vs unified caches
  • First level cache design
    • Design decisions much affected by cycle time
  • Second level cache design
    • Power, area, capacity considerations also matter

Cache performance

  • Cache size
  • Working set
  • Block size
  • Critical word
  • Subblocking
  • Associativity
  • Cache miss classification
    • Compulsory miss
      • Prefetching can be used to eliminate these
    • Conflict miss
      • More associativity
      • Other solutions: victim cache, hashing, software hints
    • Capacity miss
  • Non-blocking/lockup-free caches
    • Miss status handling registers/Miss buffers
  • Memory-level parallelism

High-bandwidth caches

  • Multiple accesses to cache in the same cycle
    • True multiporting
    • Virtual multiporting
      • used in Alpha 21264
    • Multiple cache copies
    • Banking/Interleaving

Memory

  • Row
  • Column
  • Row decoder
  • Column decoder
  • Some basic concepts
    • Address space
    • Addressability
    • Alignment
    • Interleaving
      • heavily interacts with your access pattern
  • Rank - consists of multiple chips

The DRAM subsystem

  • Channel
  • DIMM
  • Rank
  • Chip
  • Bank
  • Row/Column

Lecture 19

Aside

  • DRAM scaling issues
  • PCM and other alternative technologies

DRAM (continued)

  • Word line
  • Bit line
  • Sense amplifier → also called row-buffer
  • Row, Column address pair
  • Refresh
    • Interferes with normal operation
    • Data has to be kept intact in DRAM
  • Bank
  • DRAM chip
  • Rank - multiple chips
  • DIMM - one or more ranks
  • Multiple DIMMs together to get more capacity
  • Burst mode
  • Precharge
  • Activate - Row Address Strobe (RAS)
  • Read/Write - Column Address Strobe (CAS)
  • Critical word first design
  • Interleaving - attempts to minimize conflicts for concurrent accesses
  • Memory bandwidth problem
  • Address mapping
    • Row interleaving
    • Cache block interleaving
  • Interaction with virtual → physical mapping
  • DRAM refresh
    • Burst refresh
    • Distributed refresh
    • Different cells - need to be refreshed at different rates
    • Weak and strong cells due to variability in DRAM manufacturing process
  • Aside: Memory errors (bit flips) are the cause of blue screens several times
  • Bloom filter
  • Volatile vs Non-volatile memory
  • DRAM controller

Aside: Power Reduction Techniques

  • Dynamic power, Static power
  • Static power - depends on leakage
  • Dynamic voltage and Frequency Scaling (DVFS)
  • Power conservation principles
    • Reduce frequency
    • Turn off
  • Clock gating
  • Sleep transistors
  • Power impacts performance - Thermal Design Power
  • Turn-off/turn-on latency

Back to DRAM

  • Turn off DRAM chips when data not mapped to a chip
  • Where to place DRAM controller?
    • Chipset (off-chip)
    • CPU chip (on-chip)
  • DRAM controller
    • Request buffer
    • Arbiter
    • Bank scheduler, Bus scheduler
    • Scheduling policies
      • FCFS
      • FR-FCFS - maximizes DRAM throughput
    • Row-buffer management
      • Open row
      • Closed row
    • DRAM timing constraints - over 50 timing contraints
    • DRAM power management
      • Active
      • All banks idle
      • Self-refresh
      • Power down

Lecture 20

Bloom filters (revisited)

  • False positive
  • False negative
  • Aliasing/Collision
  • Hash function

Memory scheduling (in multicore systems)

  • FR-FCFS - maximizes DRAM throughput
  • Memory performance hig
  • Streaming, Random applications
  • Memory request buffer
  • Unfairness, low system performance
  • Denial of service
  • Inter-thread interference
  • Resource (bank, bus, row-buffer) conflicts
  • Quality of Service (QoS)
  • Stall time fair memory scheduling (STFM)
    • Estimate slowdown (using stall time)
    • Prioritize slowed down threads
    • Unfairness = Max Slowdown/Min Slowdown
  • Parallelism-aware batch scheduling (PAR-BS)
    • Preserving Memory Level Parallelism (MLP)
      • Bank level parallelism
      • Parallelism-awareness
      • Could cause starvation
    • Request batching
      • Grouping oldest requests
      • Avoids starvation
      • Principle applied in disk controllers in 1960s
    • Within-batch scheduling
      • Shortest stall-time first ranking
    • Batch sizes - marking cap
    • Thread-aware memory schedulers improve performance and fairness
  • Memory request scheduling - the major memory interference reduction approach we've seen
  • Other ways of interference reduction (and in fact, shared resource contention)
    • Data mapping (Channel partitioning)
    • Source throttling
  • Handling interference in parallel applications
    • Barrier synchronization
    • Some threads on critical path of execution due to synchronization
  • Wrap up
    • Inter-thread interference is also present in other shared resources
      • Memory controller
      • Interconnect
      • Caches

Virtual Memory

  • Illusion of infinite capacity
  • Makes programmer's life easy
    • Gives programmer the illusion of a large address space
    • Limited by the virtual address space
  • ISA has an address space greater than physical memory size
  • Basic mechanism
    • Indirection
    • Virtual address space → divided into pages
    • Physical address space → divided into frames
  • Virtual page number
  • Physical frame number
  • Page table
    • is per process
  • Page fault
  • Precise exceptions
  • Exception handler
  • Address translation
  • Page size
    • x86 has variable length page sizes ranging from 4KB to 2GB
  • Page table entry
    • Valid bit
    • Dirty bit
    • Reference or Access bit
      • Used for page replacement decisions
    • Page frame number

Lecture 21

Virtual Memory

  • 2 major functions of virtual memory
    • Address translation
    • Protection
  • Memory management unit
  • Access control
    • Types of access - Read, Write, Execute
    • Privilege level - Kernel, Executive, Supervisor, User
  • Privilege levels in x86
  • Virtual memory not always needed for access control
  • Other ways of doing access control
    • Base and bound registers
    • Segmented address space
  • Page table
    • Can be huge
  • Multi-level page tables
  • Multi-level page tables in x86
    • Page directory (first level)
    • Page table (second level)
  • Page table base register (or) Page directory base register part of a process's context
    • called CR3 in x86
  • x86 page table and page directory examples
  • x86 four-level paging
  • Making address translation faster
    • Translation Lookaside Buffer (TLB)- a cache of page table entries
    • accessed with virtual page number
    • TLB hit
    • TLB miss
  • On a TLB miss
    • Page table walk
  • Hardware managed vs software managed TLB miss
  • Cache-Virtual Memory interaction
    • Physical cache
    • Virtual cache
    • Virtual-physical cache
  • Ideally
    • Fast access of virtual cache
    • No synonym of physical cache
    • Solution: Virtual-physical cache
  • Virtually indexed physically tagged cache
    • synonym problem
    • homonym problem
  • Ways to handle the synonym problem
    • Page coloring
    • In hardware,
      • On a write, check what other physical locations the address could be stored at
      • Update/Invalidate if the cache block exists there
  • So far, virtual memory - cache interaction
  • Virtual memory - DRAM interaction also possible
    • Ex: Page coloring to minimize bank conflicts

Lecture 22

Latency tolerance

  • Instruction window
    • How large does it have to be to continue decoding?
  • Full-window stall
  • Large number of stalls due to L2 misses
  • Tolerating stalls
    • Caching
    • Prefetching
    • Multithreading
    • Out-of-order execution
  • Write buffer
    • Writes are off the critical path
  • Caching
    • Passive technique, always incurs compulsory misses
  • Prefetching
    • initially used in IBM360/91
  • Multithreading
    • Coarse grained multithreading
      • Intel Itanium II
    • Fine grained multithreading
      • Sun Niagara
      • Intel Larrabee
    • Switch on event multithreading
    • Simultaneous multithreading
    • Helper threads

Prefetching

  • Software/Hardware
  • Execution based prefetching
  • Can eliminate compulsory misses which caching can't
  • Works if predictable pattern of misses
  • Prefetching misprediction
    • No need to recover
  • Prefetcher placement in the hierarchy
  • What to prefetch?
    • Accurate prediction is important
    • Wasteful prefetches waste resources
    • Prediction based on past access patterns
    • Use compiler knowledge
    • Software prefetches - prefetch instructions
  • When to prefetch?
    • Timeliness
    • More timely by being more aggressive
    • Demand access stream
  • Where to prefetch into?
    • Which level of cache?
    • Skew cache replacement policy to favor demand-fetched blocks
  • How?
    • Software
    • Hardware
      • Ex: A stride prefetcher requires hardware to detect strides and prefetch
    • Execution-based
      • Executing a piece of the program to generate prefetches
  • Software prefetching
    • x86 prefetch instructions
      • Microarchitecture dependent
    • Prefetch distance
      • Depends on hardware implementation - portability issues
    • Compiler inserted prefetches
      • Profiling
  • Hardware prefetching
    • Next-line prefetcher
    • Stride prefetcher
      • Instruction based stride prefetching
      • Cache block address based stride prefetching
    • Stream buffer
  • Prefetcher performance
    • Accuracy
    • Coverage
    • Timeliness
    • Bandwidth consumption
    • Cache pollution

Lecture 23

Aside - Benchmarking

  • SPEC CPU benchmarks
    • Floating point
    • Integer
  • SPEC web
  • SPEC jbb
  • Special purpose vs general purpose processors
  • Field Programmable Gate Arrays (FPGA) - Special purpose design
    • Provides configurability at the logic level
  • Application Specific Integrated Circuits (ASIC) - even more special purpose than FPGAs

Back to prefetching

  • Co-operative hardware/software prefetchers
  • Irregular access patterns
    • Indirect array accesses
    • Random patterns
    • Linked data structures
  • Prefetchers for irregular access patterns
    • Correlation based prefetching
      • Markov prefetching
    • Content directed prefetching
      • Prefetches pointers
      • How do you identify a pointer?
  • Aside: How about prefetching across page boundaries?
  • Execution-based prefetchers
    • Pre-execute pruned program to prefetch
      • Where to execute?
      • When to spawn?
      • When to terminate?
    • Pre-execution thread can be in hardware or software
    • Pre-execution on a simultaneous multithreading processor
    • Slice construction

Runahead Execution

  • Ideally, very long instruction window
    • Expensive
    • Runahead - one possible way to get the benefits of a large window
  • Memory-level parallelism
  • L2-miss dependent instructions - Invalid
  • Invalid values not used for prefetching/branch resolution
  • Implemented in IBM Power6, SUN “Rock”

Lecture 24

Runahead execution (Continued)

  • Skip list
  • Runahead execution - low cost, bandwidth efficient
  • Runahead vs Increasing size of out-of-order execution window
  • Compute-bound workloads - not much improvement with runahead/prefetching
  • Runahead in Sun ROCK
  • Inefficiency in runahead
  • Dependent cache misses
    • How to parallelize?
      • Predict the value.

Multiprocessing

  • Parallelism - Doing multiple things at the same time
  • One processor at frequency F vs four processors at frequency F/4 (assuming perfectly parallelizable)
    • Four processors at frequency f/4 is lower power
  • Cost-efficiency, scalability and dependability
  • Multiple threads vs multiple processes
  • MIMD processing
    • Loosely coupled multiprocessors
      • Message passing
      • Remote procedure calls
    • Tightly coupled multiprocessors
      • Shared global memory address space
      • Synchronization of accesses to shared data
  • Tightly coupled multiprocessors (in more detail)
    • Cache consistency
    • Resource sharing/contention
    • Load imbalance
  • Aside: Hardware based multithreading
    • Coarse-grained
    • Fine-grained
    • Simultaneous multithreading
  • Comparing a single processor system with a multiprocessor system
    • Speedup in a multiprocessor system
    • Horner's method
    • Superlinear speedup
      • Working set effects
      • Cache effects
    • Utilization
    • Redundancy
    • Efficiency
  • Caveats of parallelism
    • Why diminishing returns?
    • Amdahl's law
    • Serial bottleneck
    • Parallel portion not perfectly parallel
      • Synchronization overhead
      • Load imbalance
      • Resource sharing

Lecture 25

  • Sequential bottleneck
  • Example with a task queue
  • Bottlenecks in parallel portion
  • Difficulty in parallel programming
    • “Embarrasingly parallel applications”
    • Difficult to parallelize applications

Memory Ordering

  • Memory ordering in parallel programs
    • Preserving an expected ordering
  • In a single program
    • Sequential order specified by the von Neumann model
    • Precise architectural state
    • Consistent across two runs
  • In a data flow program
    • No sequential ordering
  • In a MIMD processor
  • Why memory ordering matters?
    • Ease of debugging
    • Correctness
    • Performance and overhead
  • Correctness violation example
    • 2 processors do not see the same ordering from their respective viewpoints
    • Need for a global order - sequential consistency
  • Two issues with sequential consistency
    • Limits performance enhancement
    • Too strong
  • Relaxed memory models
    • Enforce a global order only at synchronization boundaries

Cache coherence

  • Shared memory model
  • Whose responsibility?
    • hardware
    • software
      • Ex: IBM cell
  • A simple cache coherence scheme
    • Valid/Invalid
    • Snooping the bus
    • Update vs Invalidate
  • Non-solutions to coherence
    • No hardware coherence
    • Share all caches between all processors
  • Update vs Invalidate tradeoffs
    • Depends on write frequency and sharing behavior
    • Ping-ponging
  • Two coherence methods
    • Snoopy bus
    • Directory
  • Directory based coherence
    • A central directory
      • Keeps track of the sharing state of each cache block across all processors
    • Read exclusive
    • More scalable
    • Directory could be distributed
  • Snoopy cache coherence
    • Easier to implement with a common bus
  • The MSI protocol
    • Modified
    • Shared
    • Invalid
  • MESI (or) Illinois protocol

Lecture 26

Cache coherence (continued)

  • The MESI protocol
    • Modified
    • Exclusive
    • Shared
    • Invalid
  • State diagram for Pentium Pro
  • Snoopy invalidation tradeoffs
    • Downgrade from Modified to Shared or Invalid
    • Cache-to-cache transfer
  • Problem with MESI
    • Shared state - data has to be clean
    • Unnecessary memory updates
  • Solutions to this problem
    • Invalidate block instead of M→S
    • MOESI protocol
      • one cache designated as owner for a block
  • Pipeline parallelism
  • More states and optimizations
    • Ensuring correctness is harder
  • Snoopy cache vs Directory coherence
  • Directory
    • Key operation: set inclusion test
    • False positives are ok
    • Compressed representations, bloom filters, linked lists can be used.
  • Contention resolution
  • Ping ponging

Interconnect

  • Determines scalability of the system
  • Main properties
    • Topology
    • Routing
    • Buffering and flow control
  • Topology
    • Bus
    • Point-to-point
    • Mesh
    • Buffered crossbar
      • Sun UltraSPARC T2 core-to-cache crossbar
  • Flow control
    • Back pressure
  • Multistage networks
    • Delta network
    • Omega network
  • Circuit switching
  • Packet switching
  • Ring topology
    • Implemented on several modern multicore processors
    • IBM Cell, Larabee, Sandy bridge
  • Bisection bandwidth
  • Mesh topology
  • Torus topology
  • Trees
  • Hypercubes
  • Caltech cosmic cube
    • 64-node message passing machine
  • Handling contention

Lecture 27

Interconnection networks

  • Handling contention
    • Buffer
    • Drop
    • Misroute/Deflect
  • X-Y routing/Dimension order routing
  • Flow control
  • Hot potato routing
  • Forward progress/Live lock

Multicore design

  • Multiple threads on a single core vs multiple cores
  • Amdahl's law
  • Serial portion vs parallel section demands
  • Large vs small cores
  • Example of a large core - IBM Power 4
  • Example of a small core - SUN Niagara
  • Tile-large approach
  • Tile-small approach
  • Best of both worlds - Heterogeneous multicore
    • Asymmetric multicore
  • CRAY-I
    • had scalar and vector units
    • Example of an early heterogeneous machine
  • Accelerating parallel bottlenecks
  • Accelerating critical sections
    • Programmer vs architect tradeoff
  • Contention for critical sections
  • Accelerating critical sections
    • Ship critical section to large core
  • Heterogeneous substrates - at a more higher level
    • Do not use a one size fits all approach
    • CPUs, GPUs, Reconfigurable logic on the same chip
    • Special purpose vs general purpose