18742: Reading List and Course Plan

(Required reading papers are indicated with a *)

Part I: Parallel Computer Architectures

Course Intro, Architecture Review, Amdahl's Law (1/17)

*Cramming More Components onto Integrated Circuits (AKA: Moore's Law)

*Parallel Architectures (AKA: Flynn's Taxonomy)

*Validity of the single processor approach to achieving large scale computing capabilities (AKA: Amdahl's Law)

Parallel Architectures (1/19)

*Multiscalar processors

*The Case for a Single-chip Multiprocessor

Parallel Execution Strategies

Dataflow and Tiled Architectures (1/24)

*An Evaluation of the TRIPS computer system

Dataflow execution of sequential imperative programs on multicore architectures

Evaluation of the RAW Microprocessor: An Exposed Wire-delay Architecture for ILP and Streams

Throughput Computing (1/26)

*Larrabee: a many-core x86 architecture for visual computing

*Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Writing and Executing Parallel Programs

Lecture: Parallel programming overview (1/31)

*How to make a multiprocessor computer that correctly executes multiprocess programs

*Time, clocks and the ordering of events in a distributed system

*A Primer on Consistency and Cache Coherence (Chapters 3-4)

Cache Coherence and Memory Consistency (2/9)

*Why On-chip Cache Coherence is here to stay

*Token Coherence: Decoupling Performance and Correctness

Memory consistency and event ordering in scalable shared-memory multiprocessors

Memory Consistency Models (2/14)

*Foundations of the C++ concurrency Memory Model

*x86-TSO: a rigorous and usable programmer’s model for x86 multiprocessors

Synchronization and Transaction Memory

Optimizing Synchronization (2/16)

*Speculative lock elision: enabling highly concurrent multithreaded execution

*Inferential queueing and speculative push for reducing critical communication latencies

Hardware Transactional Memory (2/21)

*Transactional Memory

*Making the fast case common and the uncommon case simple in unbounded transactional memory

Hardware Transactional Memory Implementations (2/23)

*Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack

*Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

Software Transactional Memory (2/28)

*Software Transactional Memory

*Software Transactional Memory: Why is it only a research toy?

Synthesis Lectures on Transactional Memory (AKA: the TM Book)

Memory Consistency Enforcement Mechanisms

Data-race-free and Speculative Models (3/2)

*DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism

*Transactional Memory Coherence and Consistency

BulkSC: bulk enforcement of sequential consistency

SARC Coherence: Scaling Directory Cache Coherence in Performance and Power

Threads gone wild: Dealing with Concurrent Software Bugs

Lecture: Overview of Concurrency Bugs (3/7)

*Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Deterministic Execution (3/9)

*DMP: deterministic shared memory multiprocessing

*Grace: safe multithreaded programming for C/C++

CoreDet: a compiler and runtime system for deterministic multithreaded execution

A type and effect system for deterministic parallel java

A "flight data recorder" for enabling full-system multiprocessor deterministic replay

Spring Break - No Class (3/14)

Spring Break - No Class (3/16)

Detecting and Avoiding Concurrency Bugs (3/21)

*AVIO: detecting atomicity violations via access interleaving invariants

*A Case for an interleaving constrained shared-memory multi-processor

Cooperative, Empirical Failure Avoidance for Multithreaded Programs

Finding Concurrency Bugs with Context-aware Communication Graphs

Light64: lightweight hardware support for race detection during systematic testing of parallel programs

Flexible, Hardware Acceleration for Instruction-Grain Lifeguards

Atom-aid: detecting and surviving atomicity violations

Power and Energy

Energy Modeling, Profiling, Analysis (3/23)

*Power: A First-class Architectural Design Constraint

*Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis

Flicker: a dynamically adaptive architecture for power limited multicore systems

Thread Motion: fine-grained power management for multi-core systems

Dark Silicon: The beginning of the end (3/28)

*Amdahl's Law in the Multicore Era

*Dark Silicon and the End of Multicore Scaling

(*Skim) Design of Ion-Implanted MOSFET’S with Very Small Physical Dimensions (AKA: Dennard Scaling)

Power Challenges May End the Multicore Era

Part II: Heterogeneity, Specialization, and Acceleration

Fused and Composable Heterogeneous Cores (3/30)

*Core-fusion: accomodating software diversity in chip multiprocessors

*Composable, light-weight processors

CoreGenesis: erasing core boundaries for robust and configurable performance

Enabling Dynamic Heterogeneity Through Core-on-core Stacking

Dynamic heterogeneity and the need for multicore virtualization

Specialization

Accelerators for Everything (4/4)

*Conservation cores: reducing the energy of mature computations

*QsCores: Trading Dark Silicon for Scalable Energy with Quasi-specific Cores

CHARM: a composable, heterogeneous, accelerator-rich microprocessor

Instructor Travel - Guest Lecture by Mike Bond (4/6)

*Conflict Exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races

*DRFx: a simple and efficient memory model for concurrent programming languages

Hyper-optimized Application-specific Accelerators (4/11)

*Q100: The Architecture and Design of a Database Processing Unit

*EIE: efficient inference engine on compressed deep neural network

Reconfigurable Accelerators

Reconfigurable Accelerators (4/13)

*Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

*A reconfigurable fabric for accelerating large-scale datacenter services (AKA: The Bing Paper)

Instructor Travel - No Class (4/18)

Carnival - No Class (4/20)

Reconfigurable Memory Systems (4/25)

*LEAP scratchpads: automatic memory and cache management for reconfigurable logic

*CoRAM: an in-fabric memory architecture for FPGA-based computing

Part IV: Emerging and alternative computing

Intermittent Computing

Lecture: Programming intermittent computers (4/27)

*Mementos: system support for long-running computation on RFID-scale devices

*A simpler, safer programming and execution model for intermittent systems

*Ambient Energy-harvesting nonvolatile processors: from circuit to system

An Energy-interference-free Hardware-Software Debugger for Intermittent Energy-harvesting Systems

Battery-free Wireless Identification and Sensing

Ambient Backscatter: Wireless Communication Out of Thin Air

Approximate Computing (5/2)

*Neural Acceleration for General Purpose Approximate Programs

*General-purpose code acceleration with limited-precision analog computation

Approximate storage in solid-state memories

Uncertain<t>: a first-order type for uncertain data

In-class Project Presentations (5/4)