Readings

Lecture 1

Required:

  • Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-560 in Readings in Computer Architecture. pdf
  • Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in Readings in Computer Architecture. pdf
  • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
  • Culler & Singh, Chapter 1
  • Hamming, “You and Your Research,” Bell Communications Research Colloquium Seminar, 7 March 1986. here

Optional:

  • Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. pdf
  • Kumar et al., “Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors,” ISCA 2007. pdf

Supplementary Readings on Research, Writing, Reviews:

  • Levin and Redell, “How (and how not) to write a good systems paper,” OSR 1983. pdf
  • Smith, “The Task of the Referee,” IEEE Computer 1990. pdf
  • SP Jones, “How to Write a Great Research Paper”. pdf
  • Fong, “How to Write a CS Research Paper: A Bibliography”. pdf

Lecture 2

Required:

  • Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. pdf
  • Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf
  • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
  • Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. pdf
  • Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA 2007. pdf

Optional:

  • Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966. pdf
  • Thornton, “CDC 6600: Design of a Computer,” 1970. pdf
  • Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
  • Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. pdf
  • Eyerman and Eeckhout, “Modeling critical sections in Amdahl's law and its implications for multicore design,” ISCA 2010. pdf
  • Suleman et al., “Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs,” ASPLOS 2008. pdf

Lecture 3

Required:

  • Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM 1993. pdf
  • Seitz, “The Cosmic Cube,” CACM 1985. pdf

Optional:

  • Li and Hudak, “Memory Coherence in Shared Virtual Memory Systems, ” ACM TOCS 1989. pdf
  • Batcher, “Architecture of a massively parallel processor,” ISCA 1980. pdf
  • Tucker and Robertson, “Architecture and Applications of the Connection Machine,” IEEE Computer 1988. pdf

Lecture 4

Optional:

  • Moore, “Cramming more components onto integrated circuits,” Electronics, 1965. pdf
  • Stark, “On pipelining dynamic instruction scheduling logic,” MICRO 2000. pdf
  • Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. pdf
  • Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999. pdf
  • Palacharla et al., “Complexity-effective superscalar processors,” ISCA 1997. pdf

Lecture 5

Optional:

  • Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
  • Barroso et al., “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” ISCA 2000. pdf
  • Barroso et al., “Memory system characterization of commercial workloads,” ISCA 1998. pdf
  • Ranganathan et al., “Performance of database workloads on shared-memory systems with out-of-order processors,” ASPLOS 1998. pdf
  • Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro 2005. pdf
  • Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. pdf
  • Chaudhry et al., “Rock: A High-Performance Sparc CMT Processor,” IEEE Micro, 2009. pdf
  • Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor,” ISCA 2009. pdf
  • Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003. pdf
  • Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” IEEE Micro Jan/Feb 2006. pdf
  • Tendler et al., “POWER4 system microarchitecture,” IBM J R&D, 2002. pdf
  • Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
  • Le et al., “IBM POWER6 Microarchitecture,” IBM J R&D, 2007. pdf
  • Kalla et al., “Power7: IBM’s Next-Generation Server Processor,” IEEE Micro 2010. pdf
  • Grochowski et al., “Best of both Latency and Throughput,” ICCD 2004. pdf
  • Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. pdf
  • Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf

Lecture 6

Recommended:

  • Ipek et al., “Core Fusion: Accomodating Software Diversity in Chip Multiprocessors,” ISCA 2007. pdf
  • Ausavarungnirun et al., “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” ISCA 2012. pdf

Optional:

  • Kumar et al., “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” MICRO 2003. pdf
  • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
  • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. pdf
  • Suleman et al., “Data marshaling for multi-core architectures,” ISCA 2010. pdf
  • Suleman et al., “Data Marshaling for Multicore Systems,” IEEE Micro 2011. pdf
  • Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. pdf
  • Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
  • Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
  • Kim et al., “Thread Cluster Memory Scheduling,” IEEE Micro 2011. pdf
  • Nychis et al., “Next generation on-chip networks: what kind of congestion control do we need?,” HotNets 2010. pdf
  • Das et al., “Application-aware prioritization mechanisms for on-chip networks,” MICRO 2009. pdf
  • Das et al., “Aérgia: exploiting packet latency slack in on-chip networks,” ISCA 2010. pdf
  • Das et al., “Aérgia: A Network-on-Chip Exploiting Packet Latency Slack,” IEEE Micro 2011. pdf
  • Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
  • Suleman et al., “Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs,” ASPLOS 2008. pdf
  • Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf
  • Morad et al., “Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors,” IEEE CAL 2006. pdf
  • Suleman et al., “ACMP: Balancing Hardware Efficiency and Programmer Efficiency,” HPS Technical Report 2007. pdf
  • Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. pdf
  • Suleman, “An Asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs,” PhD thesis 2010. pdf

Lecture 7

Optional:

  • Lefurgy et al., “Energy Management for Commercial Servers,” IEEE Computer 2003. pdf
  • Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009. pdf
  • Lee et al., “Phase-Change Technology and the Future of Main Memory,” IEEE Micro 2010. pdf
  • Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. pdf
  • Dhiman et al, “PDRAM: a hybrid PRAM and DRAM main memory system,” DAC 2009. pdf
  • Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
  • Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. pdf

Lecture 8

Optional:

  • Suleman et al., “Data marshaling for multi-core architectures,” ISCA 2010. pdf
  • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
  • Suleman et al., “Data Marshaling for Multicore Systems,” IEEE Micro 2011. pdf
  • Chakraborty et al., “Computation Spreading: Employing Hardware Migration to Specialize CMP Cores on-the-fly,” ASPLOS 2006. pdf
  • Rangan et al., “Thread Motion: Fine-Grained Power Management for Multi-Core Systems,” ISCA 2009. pdf

Lecture 9

Required:

  • Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. pdf
  • Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
  • Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. pdf
  • Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf

Recommended:

  • Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. pdf
  • Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
  • Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. pdf
  • Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. pdf

Optional:

  • Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
  • Kim et al., “Thread Cluster Memory Scheduling,” IEEE Micro 2011. pdf
  • Ausavarungnirun et al., “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” ISCA 2012. pdf
  • Ebrahimi et al., “Parallel Application Memory Scheduling,” MICRO 2011. pdf
  • Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
  • Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. pdf
  • Thornton, “Design of a Computer: The Control Data 6600,” 1970. pdf
  • Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964. pdf
  • McNairy and Bhatia, “Montecito: A Dual-Core, Dual-Thread Itanium Processor,” IEEE Micro 2005. pdf

Lecture 10

Required:

  • Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. pdf
  • Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
  • Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. pdf
  • Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf

Recommended:

  • Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. pdf
  • Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
  • Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. pdf
  • Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. pdf

Optional:

  • Yamamoto et al., “Performance Estimation of Multistreamed, Supersealar Processors,” HICSS 1994. pdf
  • Tullsen et al., “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” ISCA 1995. pdf
  • Snavely and Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” ASPLOS 2000. pdf
  • Jacobsen et al., “Assigning confidence to conditional branch predictions,” MICRO 1996. pdf
  • Brown and Tullsen, “Handling Long-latency Loads in a Simultaneous Multithreading Processor,” MICRO 2001. pdf
  • El-Moursy and Albonesi, “Front-End Policies for Improved Issue Efficiency in SMT Processors,” HPCA 2003. pdf
  • Raasch and Reinhardt, “The Impact of Resource Partitioning on SMT Processors,” PACT 2003. pdf
  • Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf
  • Ramirez et al., “Runahead Threads to Improve SMT Performance,” HPCA 2008. pdf
  • Van Craeynest et al., “MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor,” HiPEAC 2009. pdf
  • Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
  • Lebeck et al., “A Large, Fast Instruction Window for Tolerating Cache Misses,” ISCA 2002. pdf
  • Marr et al., “Hyper-Threading Technology Architecture and Microarchitecture,” Intel technology Journal 2002. pdf

Lecture 11

Optional:

  • Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. pdf
  • Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. pdf
  • Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. pdf
  • Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. pdf
  • Dusser et al., “Zero-Content Augmented Caches,” ICS 2009. pdf
  • Islam and Stenstrom, “Zero-Value Caches: Cancelling Loads that Return Zero,” PACT 2009. pdf
  • Yang et al., “Frequent Value Compression in Data Caches,” MICRO 2000. pdf
  • Alameldeen and Wood, “Adaptive Cache Compression for High-Performance Processors,” ISCA 2004. pdf
  • Thoziyoor et al., “A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies,” ISCA 2008. pdf
  • Ekman and Stenstrom, “A Robust Main-Memory Compression Scheme,” ISCA 2005. pdf
  • Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012. pdf
  • Ubal et al., “Multi2Sim: A Simulation Framework for CPU-GPU Computing,” PACT 2012. pdf
  • Chen et al., “C-Pack: A High-Performance Microprocessor Cache Compression Algorithm,” VLSI 2010. pdf
  • Magnusson et al., “Simics: A full system simulation platform,” Computer 2002. pdf
  • Tremaine et al., “Pinnacle: IBM MXT in a memory controller chip,” IEEE Micro 2001. pdf

Lecture 12

Optional:

  • Johnson and Hwu, “Run-Time Adaptive Cache Hierarchy Management via Reference Analysis,” ISCA 1997. pdf
  • Piquet et al., “Exploiting single-usage for effective memory management,” ACSAC 2007. pdf
  • Wu et al., “SHIP: Signature-based hit predictor for high performance caching,” MICRO 2011. pdf
  • Qureshi et al., “Adaptive insertion policies for high performance caching,” ISCA 2007. pdf
  • Jaleel et al., “Adaptive insertion policies for managing shared caches,” PACT 2008. pdf
  • Jaleel et al., “High performance cache replacement using re-reference interval prediction,” ISCA 2010. pdf
  • Xie and Loh, “PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches,” ISCA 2009. pdf
  • Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation,” MICRO 2006. pdf
  • Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. pdf

Lecture 13

Optional:

  • Reinhardt and Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” ISCA 2000. pdf
  • Rotenberg, “AR-SMT: a microarchitectural approach to fault tolerance in microprocessors,” Fault-Tolerant Computing 1999. pdf
  • Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” ISCA 2002. pdf
  • Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999. pdf
  • Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” MICRO 1999. pdf
  • Qureshi et al., “Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors,” DSN 2005. pdf
  • Zilles et al., “The use of multithreading for exception handling,” MICRO 1999. pdf
  • Dubois and Song, “Assisted Execution,” USC Tech Report 1998. pdf
  • Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),” ISCA 1999. pdf
  • Chappell et al., “Difficult-path branch prediction using subordinate microthreads,” ISCA 2002. pdf
  • Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA 2001. pdf

Lecture 15

Required:

  • Sohi et al., “Multiscalar Processors,” ISCA 1995. pdf
  • Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf

Recommended:

  • Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
  • Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. pdf
  • Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. pdf

Optional:

  • Luk, “Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors,” ISCA 2001. pdf
  • Sundaramoorthy et al., “Slipstream Processors: Improving both Performance and Fault Tolerance,“ ASPLOS 2000. pdf
  • Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window,” PACT 2005. pdf
  • Snavely and Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” ASPLOS 2000. pdf
  • Gopal et al., “Speculative Versioning Cache,” HPCA 1998. pdf
  • Franklin and Sohi, “The expandable split window paradigm for exploiting fine-grain parallelism,” ISCA 1992. pdf

Lecture 16

Required:

  • Sohi et al., “Multiscalar Processors,” ISCA 1995. pdf
  • Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf

Recommended:

  • Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
  • Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. pdf
  • Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. pdf

Optional:

  • Franklin and Sohi, “ARB: A hardware mechanism for dynamic reordering of memory references,” IEEE TC 1996. pdf
  • Vijaykumar and Sohi, “Task selection for a multiscalar processor,” MICRO 1998. pdf
  • Moshovos et al., “Dynamic Speculation and Synchronization of Data Dependences,” ISCA 1997. pdf
  • Chrysos and Emer, “Memory Dependence Prediction using Store Sets,” ISCA 1998. pdf
  • Martinez and Torrellas, “Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS 2002. pdf
  • Rajwar and Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” ASPLOS 2002. pdf
  • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
  • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. pdf
  • Shavit and Touitou, “Software transactional memory,” PODC 1995. pdf

Lecture 17

Required:

  • Dally, “Virtual Channel Flow Control,” ISCA 1990. pdf
  • Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip Networks,” ISCA 2004. pdf
  • Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. pdf
  • Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. pdf
  • Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. pdf

Recommended:

  • Fallin et al., “CHIPPER: A Low-Complexity, Bufferless Deflection Router,” HPCA 2011. pdf
  • Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012. pdf
  • Bjerregaard and Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. pdf

Optional:

  • Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM 1993. pdf
  • Das et al., “Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs,” HPCA 2009. pdf
  • Seitz, “The Cosmic Cube,” CACM 1985. pdf
  • Gottlieb et al. “The NYU Ultracomputer-designing a MIMD, shared-memory parallel machine,” ISCA 1982. pdf

Lecture 18

Required:

  • Dally, “Virtual Channel Flow Control,” ISCA 1990. pdf
  • Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip Networks,” ISCA 2004. pdf
  • Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. pdf
  • Fallin et al., “CHIPPER: A Low-Complexity, Bufferless Deflection Router,” HPCA 2011. pdf
  • Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012. pdf
  • Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. pdf

Recommended:

  • Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. pdf
  • Bjerregaard and Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. pdf
  • Chang et al., “HAT: Heterogeneous Adaptive Throttling for On-Chip Networks,” SBAC-PAD 2012. pdf

Optional:

  • Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. pdf
  • Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro 1997. pdf

Lecture 20

Optional:

  • Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. pdf
  • Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer 1994. pdf
  • Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
  • Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985. pdf
  • Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf
  • Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
  • Martinez and Torrellas, “Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS 2002. pdf
  • Rajwar and Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” ASPLOS 2002. pdf
  • Shavit and Touitou, “Software transactional memory,” PODC 1995. pdf
  • Dice et al., “Early experience with a commercial hardware transactional memory implementation,” ASPLOS 2009. pdf
  • Wang et al., “Evaluation of blue Gene/Q hardware support for transactional memories,” PACT 2012. pdf
  • Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. pdf

Lecture 21

Optional:

  • Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. pdf
  • Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer 1994. pdf
  • Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
  • Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985. pdf
  • Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA 2003. pdf
  • Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer 2004. pdf
  • Das et al., “Application-aware prioritization mechanisms for on-chip networks,” MICRO 2009. pdf
  • Das et al., “Aérgia: exploiting packet latency slack in on-chip networks,” ISCA 2010. pdf
  • Grot et al., “Express Cube Topologies for On-Chip Interconnects,” HPCA 2009. pdf
  • Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees,” ISCA 2011. pdf
  • Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009. pdf

Lecture 22

Optional:

  • Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. pdf
  • Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer 1994. pdf
  • Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
  • Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985. pdf
  • Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA 2003. pdf
  • Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer 2004. pdf
  • Dennis and Misunas, “A preliminary architecture for a basic data flow processor,” ISCA 1974. pdf
  • Treleaven et al., “Data-Driven and Demand-Driven Computer Architecture,” ACM Computing Surveys 1982. pdf
  • Veen, “Dataflow Machine Architecture,” ACM Computing Surveys 1986. pdf
  • Arvind and Nikhil, “Executing a program on the MIT tagged-token dataflow architecture,” IEEE TC 1990. pdf
  • Hwu and Patt, “HPSm, a high performance restricted data flow architecture having minimal functionality,” ISCA 1986. pdf

Lecture 23

Optional:

  • Sakai et al., “An Architecture of a Dataflow Single Chip Processor,” ISCA 1989. pdf
  • Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
  • Colwell, “The Pentium Chronicles,” Wiley-IEEE Computer Society Press 2005.
  • Kung, “Why Systolic Architectures?,” IEEE Computer 1982. pdf
  • Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986. pdf
  • Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987. pdf

Lecture 24

Required:

  • Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. pdf
  • Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf
  • Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
  • Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. pdf
  • Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. pdf
  • Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. pdf
  • Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. pdf
  • Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. pdf
  • Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. pdf
  • Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches,” ISCA 2009. pdf

Recommended:

  • Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
  • Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” MICRO 2008. pdf
  • Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008. pdf
  • Kim et al., “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” ASPLOS 2002. pdf
  • Qureshi et al., “Adaptive Insertion Policies for High-Performance Caching,” ISCA 2007. pdf
  • Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. pdf

Optional:

  • Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. pdf
  • Grot et al., “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“ MICRO 2009. pdf

Lecture 25

Required:

  • Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. pdf
  • Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf
  • Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
  • Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. pdf
  • Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. pdf
  • Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. pdf

Recommended:

  • Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
  • Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” MICRO 2008. pdf
  • Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008. pdf

Optional:

  • Moscibroda and Mutlu, “Distributed order scheduling and its application to multi-core DRAM controllers,” PODC 2008. pdf
  • Waldspurger and Weihl, “Lottery scheduling: flexible proportional-share resource management,” OSDI 1994. pdf

Lecture 26

Required:

  • Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. pdf
  • Ebrahimi et al., “Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems,” ASPLOS 2010. pdf
  • Subramanian et al., “MISE: Providing Performance Predictability in Shared Main Memory Systems,” HPCA 2013.

Recommended:

  • Kim et al., “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” MICRO 2010. pdf
  • Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
  • Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
  • Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. pdf
  • Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems,” ISCA 2008. pdf
  • Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. pdf
  • Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf

Lecture 27

Required:

  • Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. pdf
  • Ebrahimi et al, “Coordinated Control of Multiple Prefetchers in Multi-Core Systems,” HPCA 2009. pdf

Recommended:

  • Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
  • Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
  • Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
  • Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf
  • Srinath et al, “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” HPCA 2007. pdf
  • Zhuang and Lee, “A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches,” ICPP 2003. pdf
  • Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. pdf