Table of Contents
Lecture 1
- Hardware/software interface
- Programmer/Hardware Designer
- Design decisions: Example: high performance single core vs low performance multi cores
- Levels of transformation
- Instruction Set Architecture (ISA)
- Abstraction
- Benefits
- Why cross layers?
- GPUs, vector machines vs scalar machines
- GPGPUs
- VLIW - Very Long Instruction Word Architectures
- Memory hierarchy
- Multi core systems
- Shared resources
- Main memory
- Memory controller
- Unfairness at the memory controller
- DRAM organization/operation
- Rows
- Columns
- Row-Buffer
- Row-hit, row-conflict
- Compile time vs run time
- not known at compile time
- load latency
- branch direction
- address of memory operations
- Compiler profiling
- Amdahl's definition of architecture
Lecture 2
- Hardware/Software - this course is at the interface
- better software/hardware design possible if you understand the other
- mmx multimedia extensions to x86 in 1995
- Enabling new applications
- Virtual reality
- Personal genomics
- Typical 1990s ideal computer
- Single thread
- Pipelined
- Out of order execution
- Superscalar
- Current day: multicore
- Today's constraints
- Energy
- Complexity
- Fault tolerance
- Memory wall
- Out of order execution at compile time vs run time
- Bigger caches on chip?
- tradeoff
- IBM Z series - large last level cache - used in banking that earlier used mainframes
- von Neumann model
- Instruction processing cycle
- Fetch
- Decode
- Fetch operands
- Execute
- Compute next inst address
- Program counter/Instruction pointer
- Data flow model
- Firing
- Tag matching
- von Neumann vs Data flow
- ISA vs Micro architecture
- Precise state (von Neumann) vs imprecise state (Data flow)
- Current systems: Restricted data flow
- Major ISAs - x86, SPARC, MIPS
- Multiple pipelines vs multiple cores
- Architecture - ISA + Microarchitecture
- Backward compatibility
- Binary Coded Decimal (BCD)example
- Condition codes
- Design point
- Problem space/application spac
- SUN MAJC microprocessor - designed for JAVA programs
Lecture 3
- REP MOVS
- In x86, a single instruction
- In other ISAs, several instructions
- von Neumann
- Stored program
- Sequential processing
- DEC: VAX
- More than 300 instructions
- Complicated
- Index instruction, instructions that operate on doubly linked lists
- CISC vs RISC
- Instruction
- Opcode
- Operand
- LC3
- Fixed length instructions
- Uniform decode
- Bit steering
- Serial decode
- Data types
- Atomicity
- Semantic gap
- Intel 432
- Address space
- Addressability
- Big Endian/Little Endian
- Memory hierarchy
- Programmer visible vs Programmer invisible
- Addressing modes
- Indirect addressing mode → pointer operations
- Sparse matrix accesses
- ISA orthogonality → VAX - classic example
- Memory mapped vs Special I/O
- Privilege modes
- Exception/Interrupt handling
Lecture 4
- Orthogonality
- Auto-increment addressing mode
- Complex vs Simple instructions
- Semantic gap
- Translation: Implementation ISA
- Micro instructions
- x86 - small semantic gap; Hardware translates x86 instructions into simpler ones
- Transmeta - tanslates x86 into secret VLIW
- Vax: Small semantic gap example
- IBM 801 mini computer - very large semantic gap; John Cock
- Dynamic/static interface
- Intel 432
- Huffman encoding to code frequently used instructions with small number of bytesFixed vs variable length ISA
- Uniform vs non-uniform decode
- Uniform decode
- Enables parallelism
- Speculative execution
- Non-uniform decode
- Serial decode
- Addressing modes
- x86: Scale, base, index, displacement
- Other ISA level tradeoffs
- Hardware vs software interlocking
- Virtual memory vs overlay programming
- Microarchitecture
- Multiple choices to implement the same ISA
- Single cycle instruction processing
Lecture 5
- Semantic gap
- Benefits of translating in hardware vs translating in software
- Variable length, uniform decode ISA - possible?
- Uniform decode typically goes with fixed length
- RISC vs CISC
- RISC
- Simple instructions
- Fixed length
- Uniform decode
- Few addressing modes
- CISC
- Complex instructions
- Variable length
- Non-uniform decode
- Many addressing modes
- Uniform decode
- Hard to get perfect uniform decode
- MIPS - uniform within instruction types
- Semantic mismatch
- How about multiple levels of translation
- Overhead vs value add
- Basic instruction processing
- Single cycle
- Only combinational logic
- Critical path determined by slowest instruction
- Single-cycle vs Multi-cycle machines
- Datapath
- Single vs Multiple ALUs
- Control signals
- Instruction processing
- 2 separate adders needed for address generation and ALU operation
- R-type datapath
- I-type datapath
- Combining R-type and I-type datapaths
- Load/Store datapath
- Non-control-flow datapath
- Branch delay slots
- Unconditional jump datapath
Lecture 6
Single cycle implementation
- JALR instruction phases
- Fetch, Decode, Evaluate address, Store result
- Datapath vs control logic
- Semantics of a delayed branch
- Control logic
- Combinational logic - hardwired control
- Sequential logic - microprogrammed control
- 2 components to performance
- Number of cycles
- Cycle time
- Critical path
- Determined by slowest instruction to process
- Time it takes
- to access memory
- to go through ALU
- to determine the control signals
- CDC 5600
- Control logic could also affect the critical path
- Control store access took too long
- Fetched 2 control words on each access to amortize cost
- Memory latency
- High complexity - replicate resources needed more than once by an instuction
Micro architecture design principles
- Critical path design
- Bread and butter (common case) design
- Balanced design
Performance analysis
- Execution time
- CPI
- Clock cycle time
- For single instruction, CPI and clock cycle time are at odds
Microcoded implementation
- Microinstruction
- Microsequencing
- Microsequencer
- Generating control signals in same cycle as datapath processing - bad
- LC3-b datapath
- MAR - Memory Address Register
- MDR - Memory Data Register
- Gating
- J bits determine address of the next state
Lecture 7
Simulator design
- Accuracy vs speed
- Functional accuracy vs Timing accuracy
- Determines
- design space exploration time
- time to market
Microprogramming
- Details of state machine - microarchitectural choice
- Control signals
- some stored in the control store
- some generated by the datapath
- JSRR in LC-3b and JALR in LC-3b
- Example: JSR vs JSRR in LC-3b
- Design of microsequencer – depends on state encoding
Microcoding of an instruction
- Unconditional vs conditional transitions between states
- J bits determine transition between states
- Conditional transitions: Some transitions occur on specific conditions
- In-class exercise: Microcoding of states 33 and 18
Memory IO
- word alignment
- Look at MAR - Bottom bits of the memory address register
- Write enables: depend on address
- cannot come from the control store
- Tradeoff between generating control signals usirdware can be simple. No need to detect
- Fine-grained Multi threading. Ex: Sun Niagara, Intel Larrabee
- Predict and execute “speculatively”
- Interlocking
- Detect and stall vs scoreboarding
- Score boarding needs to stall for other kinds of dependences
- Detect and stall - dependency check logic is complex
- Superscalar execution
Resource contention
- Resource contention → also called resource dependence sometimes
- Solutions
- Duplicate resource Ex: separate instruction and data caches, Use multiple ports
- Stall one of the contenders
- Why (preferably) not a single 2-ported memory in a pipelined design?
- Increases additional wiring across different stages
- Whom to stall when a read and a write contend?
- The read, because letting the write finish would enable dependent instructions to execute
Lecture 10
Side note: Errors in hardware
- Allow hardware to produce errors
- Relaxed correctness
Review of Pipelining
- Ideal vs non-ideal pipelines
- Pipeline registers/control
- Balancing work
- Pipeline stalls/bubbles
- Bubble: If a stage takes more than one cycle, insert a bubble/nop
- Stall signal needs to be propagated to previous stages
- Score boarding vs hardware combinational logic
- Doing dependency checking in the compiler
- Advantage: simpler hardware
- Disadvantages:
- Compiler might not find useful work, so nops get inserted. This could increase code size.
- Can't handle branches and other variable outcome/latency events
- Dependency
- Dependency types
- Does every stage have dependency checking logic?
- In a pipeline with no forwarding, dependency checking logic in one stage is enough
- Detect and Stall
- Detect and Forward
- Fine grained multi-threading (used in old super computers)
- Fetch from a different thread every cycle
- No need to detect dependency in software or hardware
- Value prediction - Predict the value in the register
- Check if correct
- Correct: no action
- Incorrect: Flush pipeline and reexecute
- Control dependence
- Which instruction to fetch next? depends on the current instruction
- When do you know the next instruction is control flow?
- Need to determine if next instruction is control flow when you fetch
- Could stall pipeline on a branch: insert bubbles
Pipeline implementation: in general
- Safe and unsafe movement of pipeline
- RAW dependence analysis example
- Stall conditions
- Comprehensive check between source registers and destination registers
- Impact of stall on performance
- Each stall means 1 lost ALU cycle
- Depends on distance between dependences
Data forwarding
- Communicate value produced by source instruction Is to consumer instruction Ic
- Register file is but a medium to do this
- So, provide a path separate from the register file to communicate value
- Many data forwarding paths
- From the output of the ALU
- From the output of memory
- Forwarding cannot always help move the pipeline
- Load operation followed by a dependent store - can't forward
LC-3b pipeline implementation
- Conservative
- Stalls on flow dependences
- Dependency stall and branch stall signals
- MemPCMux determined by logic
- 3 types of control signals
- Datapath
- Control store
- Stall
Control dependence handling
- Branch type
- Conditional
- Unconditional
- Call
- Return
- Indirect
- Return and indirect branches - can have many targets
- Handling control dependences: at a high level
- Critical to keep pipeline full
- Predicting branches: Taken vs Not-taken
- Around 60% branches are always taken
- Predicting more intelligently - branch prediction
- Example: Take the path it took last time
- Delayed branching: branch delay slots
- Fine grained multithrrmance this way
- CMOV instruction
- Adv: Good if misprediction cost > useless work due to predication
- Disadv: Three sources
- Disadv: Compiler technique; so not adaptive
- Predicated execution in Intel Itanium
- Conditional execution in ARM ISA
Multi-path execution
- Execute both paths after a conditional branch
- Do confidence estimation
- Execute both paths on a branch
- Each path requires its own context
- Paths followed become exponential
Other branches
- Returns
- Returns have matching calls
- Use stack to predict return address
- Indirect branches: many targets
- Ex: switch case statements, virtual function calls
- Indirect branch prediction
- No direction prediction needed
- Solution 1: Predict last resolved target as next address
- Solution 2: Use branch history
Issues in branch prediction
- Prediction is latency critical
- More complicated a pipeline width increases
- Index four times?
- Have four ports in the branch predictor
Multiple instruction fetch
- Flynn's bottleneck
- VLIW - compiler decides what can be executed in parallel
- Superscalar - hardware detects dependences between instructions fetched in the same cycle
- Compiler vs hardware
- Metaflow
- Had a large number of instructions fetched in a cycle
- Compiler had to find that many independent instructions
- Suffered from the NOP problem
Precise Exceptions
- Multi-cycle execute
- What if floating point multiply incurs an exception?
- Exceptions vs Interrupts
- Exception: internal to the process
- Interuupt: external to the process
- When to service
- Exceptions - immediate
- Interrupts - when convenient
Lecture 13
Precice Exceptions/Interrupts
- Precise state
- Consistent architectural state when exception/interrupt executed
- Why precise exceptions?
- To preserve sequential (von Neumann) semantics
- Makes software debugging and recovery from exceptions easier
How to ensure precise exceptions?
- Make all operations take same time
- Bad idea
- Reorder buffer (ROB)
- Future File (FF)
Reorder buffer
- Keeps track of oldest and youngest instruction
- Instructions write to ROB on completion
- ROB entry (PC, exception bit, valid bit etc.,)
- Instructions allowed to updated arch state when they become oldest
- Exceptions become visible in program order too
- What if later operation needs value in ROB?
- What if you run out of ROB entries?
- ROB's a circular buffer
- Implementation of the ROB
- Register renaming
- ROB gives illusion of a larger register file
Future File
- ROB to ensure updates to arch register file
- Instructions write to Future file and ROB.
- Adv: In decode stage, access only future file. So fast
- On exception, arch reg file copied to future file
Out-of-order execution
- Data dependences stall the pipeline
- Preventing dispatch stalls
- Move dependent instructions out of the way of independent instructions
- Enabling OoO execution
- Need to link producers and consumers
- Buffer instructions until they are ready
- Track readiness of each instruction's sources
- Dispatch instruction when sources available
- Register Alias table
- Tomasulo's algorithm
- Read source registers from RAT
- Allocate RS entry
- Place sources into RS entry
- Rename destination register
- Select logic to arbitrate between ready instructions at each execution unit
- In summary
- Renaming
- Dynamically built dataflow
Lecture 15
Out of Order Execution
- Precise expections
- Reorder buffer
- Register Alias Table
- Future file
- Tag broadcast → dataflow like
- OoO engine - builds dataflow graph of a piece of a program
- Latency tolerance
- a long latency operation
- independent dependence chains
- Degree of latency tolerance determined by instruction window/ROB size
- Instruction window
- Degree of independence within a single threaded program?
- Branch prediction in an out-of-order machine
- Combining super scalar and out-of-order
- Distinction between super scalar and out-of-order
- Load/Store dependencies in an out-of-order execution engine
- Memory disambiguation problem
- Reservation stations - distrubuted vs centralized?
- Irregular parallelism
Data Flow
- OoO machines - Restricted data flow at the ISA level
- Conditional, relational, barrier synchronization nodes
- Tokens
- <instruction pointer, port, data>
- Control flow vs Data flow
- Different parts of a data flow machine
- Matching area
- Instruction fetch area
- Data flow processing element
- MIT tagged token dataflow architecture
- Data flow → Asynchronous
- Manchester data flow machine
- OoO vs Data flow
- OoO implements data flow under the hood
- Preserves sequential semantics
Vector Processing
- Exploiting regular parallelism
- Single Instruction Multiple Data (SIMD)
- Time Space Duality
- Array processor
- Vector processor
- VLIW - Very Long Instruction Word
Lecture 16
- SISD
- SIMD
- MISD
- does not really exist, systolic array processors - a close approximation
- MIMD
- SIMD
- Array vs Vector processor
- Vector processing
- Row major, Column major
- Stride - distance between two elements of a vector
- Main Advantages
- Amortizes cost of instruction fetch, decode
- Main Disadvantages
- Works only if parallelism is regular
- Vector vs Scalar operation
- Amdahl's law
- Sequential bottlenecks - degrade performance in a vector processor
- Vector registers
- Vector functional units
- Multiporting
- Vector Memory System
- Vector Registers
- Address generator
- Base/Stride
- Memory banks
- Bank conflicts
- Vector chaining: data forwarding from one vector functional unit to another
- What is vector is not stored in a strided fashion in memory?
- Gather/Scatter operations
- Masked operations
- These are really predicated operations.
- Density-time implementation
- Automatic code vectorization
- SIMD
- Intel MMX instructions
- GPU
- Thread warp: A set of threads that execute the same instruction
- Warp-based SIMD vs Traditional SIMD
- Latency hiding with thread warps
- Fine-grained multi threading
- SPMD
- Branch divergence problem
- VLIW
- Compiler
- Microarchitecture dependent - exposing microarchitecture to the compiler
- No dependency check between and across instructions
- Commercial VLIW machines
- Multiflow TRACE
- Transmata Crusoe
- TI C6000, Trimedia, STMicro
- Intel-64
- Instruction bundles
- VLIW NOP problem
- Intimately tied with microarchitecture
- VLIW optimizations being applied in superscalar machines
Lecture 17
Memory
- Latency
- Bandwidth
- AMD Barcelona
- Ideal memory
- Zero access time
- Zero cost
- Infinite capacity
- Ideal memory requirements - opposing
- large, fast memory vs small, slow memory
- Two key memory technologies
- DRAM
- SRAM
DRAM
- Rows
- Columns
- Row decoder
- Column decoder
- Bit lines, Word lines
- Column address supplied after row address - to save pins
- Refresh
Memory hierarchy
- Large, fast memory vs small, slow memory
- Register file also part of memory hierarchy
- Memory locality
- Temporal
- Spatial
Caching
- Cache line/block
- Ideal Load-to-use-latency = 1 cycle
- Cache hierarchy
- Manual vs automatic memory management
- Hit rate, Miss rate
- Hit/miss penalty
- Pentium 4 hit/miss latency numbers
- Important cache decisions
- Placement
- Replacement
- Separate I and D caches vs unified I/D cache
- Tag store/Data store
- Tag bits
- Index bits
- Conflict miss
- Set associativity
- Full associativity?
- Replacement in set associative caches
- Random
- FIFO
- Least recently used
- Least costly to re-fetch
- Optimal?
- Implementing perfect LRU
- Expensive
- Other alternatives
- Not MRU
- Hierarchical LRU
- Belady's optimal
- Tag store entry
- Valid bit
- Tag
- Replacement policy bits
- Dirty bit - write back vs write through?
Lecture 18
Caches (continued)
- Write back vs Write through
- Cache consistency/coherence
- Allocate/No-allocate on a write miss
- Sectored caches
- Separate Instruction, data caches vs unified caches
- First level cache design
- Design decisions much affected by cycle time
- Second level cache design
- Power, area, capacity considerations also matter
Cache performance
- Cache size
- Working set
- Block size
- Critical word
- Subblocking
- Associativity
- Cache miss classification
- Compulsory miss
- Prefetching can be used to eliminate these
- Conflict miss
- More associativity
- Other solutions: victim cache, hashing, software hints
- Capacity miss
- Non-blocking/lockup-free caches
- Miss status handling registers/Miss buffers
- Memory-level parallelism
High-bandwidth caches
- Multiple accesses to cache in the same cycle
- True multiporting
- Virtual multiporting
- used in Alpha 21264
- Multiple cache copies
- Banking/Interleaving
Memory
- Row
- Column
- Row decoder
- Column decoder
- Some basic concepts
- Address space
- Addressability
- Alignment
- Interleaving
- heavily interacts with your access pattern
- Rank - consists of multiple chips
The DRAM subsystem
- Channel
- DIMM
- Rank
- Chip
- Bank
- Row/Column
Lecture 19
Aside
- DRAM scaling issues
- PCM and other alternative technologies
DRAM (continued)
- Word line
- Bit line
- Sense amplifier → also called row-buffer
- Row, Column address pair
- Refresh
- Interferes with normal operation
- Data has to be kept intact in DRAM
- Bank
- DRAM chip
- Rank - multiple chips
- DIMM - one or more ranks
- Multiple DIMMs together to get more capacity
- Burst mode
- Precharge
- Activate - Row Address Strobe (RAS)
- Read/Write - Column Address Strobe (CAS)
- Critical word first design
- Interleaving - attempts to minimize conflicts for concurrent accesses
- Memory bandwidth problem
- Address mapping
- Row interleaving
- Cache block interleaving
- Interaction with virtual → physical mapping
- DRAM refresh
- Burst refresh
- Distributed refresh
- Different cells - need to be refreshed at different rates
- Weak and strong cells due to variability in DRAM manufacturing process
- Aside: Memory errors (bit flips) are the cause of blue screens several times
- Bloom filter
- Volatile vs Non-volatile memory
- DRAM controller
Aside: Power Reduction Techniques
- Dynamic power, Static power
- Static power - depends on leakage
- Dynamic voltage and Frequency Scaling (DVFS)
- Power conservation principles
- Reduce frequency
- Turn off
- Clock gating
- Sleep transistors
- Power impacts performance - Thermal Design Power
- Turn-off/turn-on latency
Back to DRAM
- Turn off DRAM chips when data not mapped to a chip
- Where to place DRAM controller?
- Chipset (off-chip)
- CPU chip (on-chip)
- DRAM controller
- Request buffer
- Arbiter
- Bank scheduler, Bus scheduler
- Scheduling policies
- FCFS
- FR-FCFS - maximizes DRAM throughput
- Row-buffer management
- Open row
- Closed row
- DRAM timing constraints - over 50 timing contraints
- DRAM power management
- Active
- All banks idle
- Self-refresh
- Power down
Lecture 20
Bloom filters (revisited)
- False positive
- False negative
- Aliasing/Collision
- Hash function
Memory scheduling (in multicore systems)
- FR-FCFS - maximizes DRAM throughput
- Memory performance hig
- Streaming, Random applications
- Memory request buffer
- Unfairness, low system performance
- Denial of service
- Inter-thread interference
- Resource (bank, bus, row-buffer) conflicts
- Quality of Service (QoS)
- Stall time fair memory scheduling (STFM)
- Estimate slowdown (using stall time)
- Prioritize slowed down threads
- Unfairness = Max Slowdown/Min Slowdown
- Parallelism-aware batch scheduling (PAR-BS)
- Preserving Memory Level Parallelism (MLP)
- Bank level parallelism
- Parallelism-awareness
- Could cause starvation
- Request batching
- Grouping oldest requests
- Avoids starvation
- Principle applied in disk controllers in 1960s
- Within-batch scheduling
- Shortest stall-time first ranking
- Batch sizes - marking cap
- Thread-aware memory schedulers improve performance and fairness
- Memory request scheduling - the major memory interference reduction approach we've seen
- Other ways of interference reduction (and in fact, shared resource contention)
- Data mapping (Channel partitioning)
- Source throttling
- Handling interference in parallel applications
- Barrier synchronization
- Some threads on critical path of execution due to synchronization
- Wrap up
- Inter-thread interference is also present in other shared resources
- Memory controller
- Interconnect
- Caches
Virtual Memory
- Illusion of infinite capacity
- Makes programmer's life easy
- Gives programmer the illusion of a large address space
- Limited by the virtual address space
- ISA has an address space greater than physical memory size
- Basic mechanism
- Indirection
- Virtual address space → divided into pages
- Physical address space → divided into frames
- Virtual page number
- Physical frame number
- Page table
- is per process
- Page fault
- Precise exceptions
- Exception handler
- Address translation
- Page size
- x86 has variable length page sizes ranging from 4KB to 2GB
- Page table entry
- Valid bit
- Dirty bit
- Reference or Access bit
- Used for page replacement decisions
- Page frame number
Lecture 21
Virtual Memory
- 2 major functions of virtual memory
- Address translation
- Protection
- Memory management unit
- Access control
- Types of access - Read, Write, Execute
- Privilege level - Kernel, Executive, Supervisor, User
- Privilege levels in x86
- Virtual memory not always needed for access control
- Other ways of doing access control
- Base and bound registers
- Segmented address space
- Page table
- Can be huge
- Multi-level page tables
- Multi-level page tables in x86
- Page directory (first level)
- Page table (second level)
- Page table base register (or) Page directory base register part of a process's context
- called CR3 in x86
- x86 page table and page directory examples
- x86 four-level paging
- Making address translation faster
- Translation Lookaside Buffer (TLB)- a cache of page table entries
- accessed with virtual page number
- TLB hit
- TLB miss
- On a TLB miss
- Page table walk
- Hardware managed vs software managed TLB miss
- Cache-Virtual Memory interaction
- Physical cache
- Virtual cache
- Virtual-physical cache
- Ideally
- Fast access of virtual cache
- No synonym of physical cache
- Solution: Virtual-physical cache
- Virtually indexed physically tagged cache
- synonym problem
- homonym problem
- Ways to handle the synonym problem
- Page coloring
- In hardware,
- On a write, check what other physical locations the address could be stored at
- Update/Invalidate if the cache block exists there
- So far, virtual memory - cache interaction
- Virtual memory - DRAM interaction also possible
- Ex: Page coloring to minimize bank conflicts
Lecture 22
Latency tolerance
- Instruction window
- How large does it have to be to continue decoding?
- Full-window stall
- Large number of stalls due to L2 misses
- Tolerating stalls
- Caching
- Prefetching
- Multithreading
- Out-of-order execution
- Write buffer
- Writes are off the critical path
- Caching
- Passive technique, always incurs compulsory misses
- Prefetching
- initially used in IBM360/91
- Multithreading
- Coarse grained multithreading
- Intel Itanium II
- Fine grained multithreading
- Sun Niagara
- Intel Larrabee
- Switch on event multithreading
- Simultaneous multithreading
- Helper threads
Prefetching
- Software/Hardware
- Execution based prefetching
- Can eliminate compulsory misses which caching can't
- Works if predictable pattern of misses
- Prefetching misprediction
- No need to recover
- Prefetcher placement in the hierarchy
- What to prefetch?
- Accurate prediction is important
- Wasteful prefetches waste resources
- Prediction based on past access patterns
- Use compiler knowledge
- Software prefetches - prefetch instructions
- When to prefetch?
- Timeliness
- More timely by being more aggressive
- Demand access stream
- Where to prefetch into?
- Which level of cache?
- Skew cache replacement policy to favor demand-fetched blocks
- How?
- Software
- Hardware
- Ex: A stride prefetcher requires hardware to detect strides and prefetch
- Execution-based
- Executing a piece of the program to generate prefetches
- Software prefetching
- x86 prefetch instructions
- Microarchitecture dependent
- Prefetch distance
- Depends on hardware implementation - portability issues
- Compiler inserted prefetches
- Profiling
- Hardware prefetching
- Next-line prefetcher
- Stride prefetcher
- Instruction based stride prefetching
- Cache block address based stride prefetching
- Stream buffer
- Prefetcher performance
- Accuracy
- Coverage
- Timeliness
- Bandwidth consumption
- Cache pollution
Lecture 23
Aside - Benchmarking
- SPEC CPU benchmarks
- Floating point
- Integer
- SPEC web
- SPEC jbb
- Special purpose vs general purpose processors
- Field Programmable Gate Arrays (FPGA) - Special purpose design
- Provides configurability at the logic level
- Application Specific Integrated Circuits (ASIC) - even more special purpose than FPGAs
Back to prefetching
- Co-operative hardware/software prefetchers
- Irregular access patterns
- Indirect array accesses
- Random patterns
- Linked data structures
- Prefetchers for irregular access patterns
- Correlation based prefetching
- Markov prefetching
- Content directed prefetching
- Prefetches pointers
- How do you identify a pointer?
- Aside: How about prefetching across page boundaries?
- Execution-based prefetchers
- Pre-execute pruned program to prefetch
- Where to execute?
- When to spawn?
- When to terminate?
- Pre-execution thread can be in hardware or software
- Pre-execution on a simultaneous multithreading processor
- Slice construction
Runahead Execution
- Ideally, very long instruction window
- Expensive
- Runahead - one possible way to get the benefits of a large window
- Memory-level parallelism
- L2-miss dependent instructions - Invalid
- Invalid values not used for prefetching/branch resolution
- Implemented in IBM Power6, SUN “Rock”
Lecture 24
Runahead execution (Continued)
- Skip list
- Runahead execution - low cost, bandwidth efficient
- Runahead vs Increasing size of out-of-order execution window
- Compute-bound workloads - not much improvement with runahead/prefetching
- Runahead in Sun ROCK
- Inefficiency in runahead
- Dependent cache misses
- How to parallelize?
- Predict the value.
Multiprocessing
- Parallelism - Doing multiple things at the same time
- One processor at frequency F vs four processors at frequency F/4 (assuming perfectly parallelizable)
- Four processors at frequency f/4 is lower power
- Cost-efficiency, scalability and dependability
- Multiple threads vs multiple processes
- MIMD processing
- Loosely coupled multiprocessors
- Message passing
- Remote procedure calls
- Tightly coupled multiprocessors
- Shared global memory address space
- Synchronization of accesses to shared data
- Tightly coupled multiprocessors (in more detail)
- Cache consistency
- Resource sharing/contention
- Load imbalance
- Aside: Hardware based multithreading
- Coarse-grained
- Fine-grained
- Simultaneous multithreading
- Comparing a single processor system with a multiprocessor system
- Speedup in a multiprocessor system
- Horner's method
- Superlinear speedup
- Working set effects
- Cache effects
- Utilization
- Redundancy
- Efficiency
- Caveats of parallelism
- Why diminishing returns?
- Amdahl's law
- Serial bottleneck
- Parallel portion not perfectly parallel
- Synchronization overhead
- Load imbalance
- Resource sharing
Lecture 25
- Sequential bottleneck
- Example with a task queue
- Bottlenecks in parallel portion
- Difficulty in parallel programming
- “Embarrasingly parallel applications”
- Difficult to parallelize applications
Memory Ordering
- Memory ordering in parallel programs
- Preserving an expected ordering
- In a single program
- Sequential order specified by the von Neumann model
- Precise architectural state
- Consistent across two runs
- In a data flow program
- No sequential ordering
- In a MIMD processor
- Why memory ordering matters?
- Ease of debugging
- Correctness
- Performance and overhead
- Correctness violation example
- 2 processors do not see the same ordering from their respective viewpoints
- Need for a global order - sequential consistency
- Two issues with sequential consistency
- Limits performance enhancement
- Too strong
- Relaxed memory models
- Enforce a global order only at synchronization boundaries
Cache coherence
- Shared memory model
- Whose responsibility?
- hardware
- software
- Ex: IBM cell
- A simple cache coherence scheme
- Valid/Invalid
- Snooping the bus
- Update vs Invalidate
- Non-solutions to coherence
- No hardware coherence
- Share all caches between all processors
- Update vs Invalidate tradeoffs
- Depends on write frequency and sharing behavior
- Ping-ponging
- Two coherence methods
- Snoopy bus
- Directory
- Directory based coherence
- A central directory
- Keeps track of the sharing state of each cache block across all processors
- Read exclusive
- More scalable
- Directory could be distributed
- Snoopy cache coherence
- Easier to implement with a common bus
- The MSI protocol
- Modified
- Shared
- Invalid
- MESI (or) Illinois protocol
Lecture 26
Cache coherence (continued)
- The MESI protocol
- Modified
- Exclusive
- Shared
- Invalid
- State diagram for Pentium Pro
- Snoopy invalidation tradeoffs
- Downgrade from Modified to Shared or Invalid
- Cache-to-cache transfer
- Problem with MESI
- Shared state - data has to be clean
- Unnecessary memory updates
- Solutions to this problem
- Invalidate block instead of M→S
- MOESI protocol
- one cache designated as owner for a block
- Pipeline parallelism
- More states and optimizations
- Ensuring correctness is harder
- Snoopy cache vs Directory coherence
- Directory
- Key operation: set inclusion test
- False positives are ok
- Compressed representations, bloom filters, linked lists can be used.
- Contention resolution
- Ping ponging
Interconnect
- Determines scalability of the system
- Main properties
- Topology
- Routing
- Buffering and flow control
- Topology
- Bus
- Point-to-point
- Mesh
- Buffered crossbar
- Sun UltraSPARC T2 core-to-cache crossbar
- Flow control
- Back pressure
- Multistage networks
- Delta network
- Omega network
- Circuit switching
- Packet switching
- Ring topology
- Implemented on several modern multicore processors
- IBM Cell, Larabee, Sandy bridge
- Bisection bandwidth
- Mesh topology
- Torus topology
- Trees
- Hypercubes
- Caltech cosmic cube
- 64-node message passing machine
- Handling contention
Lecture 27
Interconnection networks
- Handling contention
- Buffer
- Drop
- Misroute/Deflect
- X-Y routing/Dimension order routing
- Flow control
- Hot potato routing
- Forward progress/Live lock
Multicore design
- Multiple threads on a single core vs multiple cores
- Amdahl's law
- Serial portion vs parallel section demands
- Large vs small cores
- Example of a large core - IBM Power 4
- Example of a small core - SUN Niagara
- Tile-large approach
- Tile-small approach
- Best of both worlds - Heterogeneous multicore
- Asymmetric multicore
- CRAY-I
- had scalar and vector units
- Example of an early heterogeneous machine
- Accelerating parallel bottlenecks
- Accelerating critical sections
- Programmer vs architect tradeoff
- Contention for critical sections
- Accelerating critical sections
- Ship critical section to large core
- Heterogeneous substrates - at a more higher level
- Do not use a one size fits all approach
- CPUs, GPUs, Reconfigurable logic on the same chip
- Special purpose vs general purpose