Buzzwords

Lecture 1

  • Architecture of Parallel Computers
    • Fundamentals and Tradeoffs
  • Static and Dynamic Scheduling
  • Parallel Task Assignment
    • Static/Dynamic
    • Task Queues
    • Task Stealing

Lecture 2

  • Parallel Computer
  • SISD, SIMD, MISD, MIMD
  • Performance
  • Power consumption
  • Cost efficiency
  • Scalability
  • Complexity
  • Dependability
  • Instruction Level Parallelism
  • Data Parallelism
  • Task Level Parallelism
    • Parallel programming
    • Thread level speculation
  • Loosely/Tightly coupled multiprocessors
  • Shared memory synchronization
  • Cache consistency
  • Ordering of memory operations
  • Hardware-based Multithreading
    • Coarse grained
    • Fine grained
    • Simultaneous
  • Amdahl’s Law
    • Serial bottleneck
    • Synchronization overhead
    • Load imbalance overhead
    • Resource sharing overhead
  • Superlinear Speedup
    • Unfair comparisons
    • Memory/cache effect
  • Utilization, Redundancy, Efficiency
  • Parallel Programming
  • Parallel and Serial Bottlenecks

Lecture 3

  • Programming Models vs. Architectures
  • Shared memory programming model
  • Message passing programming model
  • Shared memory hardware
  • Message passing hardware
  • Communication abstraction
  • Generic Parallel Machine
  • Data Flow Graph
  • Synchronization
  • Application Binary Interface (ABI)
  • Data parallel programming model
  • Data parallel hardware
  • Connection Machine
  • Data flow programming model
  • Data flow hardware
  • Scalability
  • Interconnection Schemes
  • Uniform Memory/Cache Access (UMA/UCA)
  • Memory latency
  • Memory bandwidth
  • Symmetric multiprocessing (SMP)
  • Data placement
  • Non-Uniform Memory/Cache Access (NUMA/NUCA)
  • Local and remote memories
  • Critical path of memory access

Lecture 4

  • Multi-Core Processors
  • Technology scaling
  • Transistors and die area
  • Large Superscalar
  • Single-thread performance
  • Instruction issue queue
  • Multi-ported register file
  • Loop-level parallelism
  • Multiprogramming
  • Bigger caches
  • Multithreading
  • Thread-level parallelism
  • Resource sharing
  • Integrating platform components
  • Clustered superscalar processor
  • Inter-cluster bypass
  • Traditional symmetric multiprocessors

Lecture 5

  • Chip Multiprocessor (CMP)
  • Workload Characteristics
  • Instruction Level Parallelism (ILP)
  • Piranha CMP
  • Processing Node
  • Coherence Protocol Engine
  • I/O Node
  • Sun Niagara (UltraSPARC T1)
  • Niagara Core
  • Sun Niagara II (UltraSPARC T2)
  • Chip Multithreading (CMT)
  • Sun Rock
  • Runahead Execution
  • Memory Level Parallelism (MLP)
  • IBM POWER4
  • IBM POWER5
  • IBM POWER6
  • IBM POWER7
  • Large vs. Small Cores
  • Tile-Large vs. Tile-Small
  • Asymmetric Chip Multiprocessor (ACMP)
  • Serial Bottlenecks
  • Amdahl's Law
  • Asymmetric vs. Symmetric Cores
  • Frequency Boosting
  • EPI Throttling
  • Dynamic voltage frequency scaling (DVFS)

Lecture 6

  • EPI Throttling
  • Asymmetric Chip Multiprocessor (ACMP)
  • Energy Efficiency
  • Programmer effort
  • Shared Resource Management
  • Serialized Code Sections
  • Accelerated Critical Sections (ACS)
  • Bottleneck Identification and Scheduling (BIS)

Lecture 7

  • Main Memory
  • Memory Capacity
  • Memory Latency
  • Memory Bandwidth
  • Memory Energy/Power
  • Technology Scaling
  • DRAM Scaling
  • Charge Memory
  • Resistive Memory
  • Non-volatile Memory
  • Phase Change Memory (PCM)
  • Hybrid Memory
  • Write Filtering
  • Row-Locality Aware Data Placement
  • Tags in Memory
  • Dynamic Data Transfer Granularity
  • Memory Security

Lecture 8

  • Barriers
  • Thread Waiting
  • Bottleneck Acceleration
  • False Serialization
  • Starvation
  • Preemptive Acceleration
  • Staged Execution Model
  • Segment Spawning
  • Inter-segment data
  • Generator instruction
  • Data Marshaling
  • Pipeline Parallelism
  • Coverage, Accuracy, Timeliness

Lecture 9

  • Memory Scheduling
  • Fairness-Throughput
  • Thread cluster
  • Memory intensity
  • CPU-GPU Systems
  • Heterogeneous Memory Systems
  • Thread
  • Multitasking
  • Thread context
  • Hardware Multithreading
  • Latency tolerance
  • Fine-grained Multithreading
  • Pipeline utilization
  • Coarse-grained Multithreading
  • Stall events
  • Thread Switching Urgency
  • Fairness

Lecture 10

  • Fine-grained Multithreading
  • Coarse-grained Multithreading
  • Fairness and throughput
  • Thread Switching Urgency
  • Simultaneous Multithreading
  • Functional Unit Utilization
  • Superscalar Out-of-Order Pipeline
  • SMT Pipeline
  • SMT Scalability
  • SMT Fetch Policy
  • Long Latency Loads
  • Memory-Level Parallelism (MLP)
  • Runahead Threads
  • Thread Priority Support
  • Thread Throttling

Lecture 11

  • Utility cache partitioning
  • Cache capacity
  • Cache data compression
  • Frequent value compression
  • Frequent pattern compression
  • Low dynamic range
  • Base+Delta encoding
  • Main memory compression
  • IBM MXT
  • Linearly compressed pages

Lecture 13

  • Fault and Error
  • Fault Detection
  • Fault Tolerance
  • Transient Fault
  • Permanent Fault
  • Space redundancy
  • Time redundancy
  • Lockstepping
  • Simultaneous Redundant Threading (SRT)
  • Sphere of Replication
  • Input Replication
  • Output Comparison
  • Branch Outcome Queue
  • Line Prediction Queue
  • Chip Level Redundant Threading
  • Exception Handling
  • Helper Threading for Prefetching
  • Thread-Based Pre-Execution

Lecture 15

  • Slipstreaming
  • Instruction Removal
  • Dual Core Execution
  • Thread Level Speculation
  • Conflict Detection
  • Speculative Parallelization
  • Inter-Thread Communication
  • Data Dependences and Versioning
  • Speculative Memory State
  • Multiscalar Processor

Lecture 16

  • Multiscalar Processor
  • Multiscalar Tasks
  • Register Forwarding
  • Task Sequencing
  • Inter-Task Dependences
  • Address Resolution Buffer
  • Memory Dependence Prediction
  • Store-Load Dependencies
  • Memory Disambiguation
  • Speculative Lock Elision
  • Atomicity
  • Speculative Parallelization
  • Accelerating Critical Section
  • Transactional Lock Removal

Lecture 17

  • Interconnection Network
  • Network Topology
  • Bus
  • Crossbar
  • Ring
  • Mesh
  • Torus
  • Tree
  • Hypercube
  • Multistage Logarithmic Network
  • Circuit vs. Packet Switching
  • Flow Control
  • Head of Line Blocking
  • Virtual Channel Flow Control
  • Communicating Buffer Availability

Lecture 18

  • Routing
  • Deadlock
  • Router Design
  • Router Pipeline Optimizations
  • Interconnection Network Performance
  • Packet Scheduling
  • Bufferless Deflection Routing
  • Livelock
  • Packet Reassembling
  • Golden Packet
  • Minimally-Buffered Deflection Routing
  • Side Buffer
  • Heterogeneous Adaptive Throttling
  • Application-Aware Source Throttling
  • Dynamic Throttling Rate Adjustment

Lecture 20

  • Locks vs. Transactions
  • Transactional Memory
    • Logging/buffering
    • Conflict detection
    • Abort/rollback
    • Commit
  • Routing
    • Deterministic
    • Oblivious
    • Adaptive
  • Deadlock

Lecture 21

  • Packet Scheduling
  • Stall Time Criticality
  • Memory Level Parallelism
  • Shortest Job First Principle
  • Application Aware
  • Packet Ranking and Batching
  • Slack of Packets
  • Packet Prioritizing using Slack
  • Starvation Avoidance
  • 2-D Mesh, Concentration, Replication
  • Flattened Butterfly
  • Multidrop Express Channels (MECS)
  • Kilo-NoC
  • Network-on-Chip (NoC) Quality of Service (QoS)
  • Topology-Aware QoS

Lecture 22

  • Data Flow
  • Data Flow Nodes
  • Data Flow Graphs
  • Control Flow vs. Data Flow
  • Static Data Flow
  • Reentrant code (Function calls, Loops)
  • Dynamic Data Flow
  • Frame Pointer
  • Tagging
  • Data Structures
  • I-Structure
  • MIT Tagged Token Data Flow Architecture
  • Manchester Data Flow Machine
  • Combining Data Flow and Control Flow

Lecture 23

  • Combining Data Flow and Control Flow
  • Macro Dataflow
  • Restricted Data Flow
  • Systolic Architecture
  • Systolic Computation
  • Pipeline Parallelism

Lecture 24

  • Resource Sharing
  • Shared Resource Management and QoS
  • Resource Sharing vs. Partitioning
  • Multi-core Caching
  • Shared Cache Management
  • Sharing in Main Memory
  • Memory Controller
  • Inter-Thread Interference
  • QoS-Aware Memory Scheduling
  • Stall-Time Fairness
  • Bank Parallelism-Awareness
  • Request Batching
  • Shortest Stall-Time First Ranking
  • Memory Episode Lengths
  • Least Attained Service

Lecture 25

  • QoS-Aware Memory Request Scheduling
  • Smart/Dumb Resources
  • Throughput vs. Fairness
  • Thread Cluster Memory Scheduling
  • Clustering Threads
  • CPU-GPU Systems
  • Staged Memory Scheduling
  • Parallel Application Memory QoS

Lecture 26

  • QoS-Aware Memory Systems
  • Smart vs. Dumb Resources
  • Memory Channel Partitioning
  • Application-Awareness
  • Multiple Channels
  • Memory Intensity
  • Row Buffer Locality
  • Preferred Channel
  • Integrated Memory Partitioning and Scheduling
  • Fairness via Source Throttling
  • Dynamic Request Throttling
  • Estimating System Unfairness
  • Inter-Core Interference
  • Row Buffer Interference
  • Memory Interference-induced Slowdown Estimation
  • Shared Memory Performance Predictability
  • Shared Resource Interference
  • Memory Phase Fraction
  • Alone Request Service Rate
  • Shared Request Service Rate
  • “Soft” Slowdown Guarantees

Lecture 27

  • CPU-GPU Memory Scheduling
  • Batch Formation
  • Batch Scheduler
  • DRAM Command Scheduler
  • Prefetcher Accuracy
  • Feedback-Directed Prefetching
  • Hierarchical Prefetcher Aggressiveness Control
  • Inter-Core Cache Pollution
  • Global Control