readings

Lecture 1

Required:

Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-560 in Readings in Computer Architecture. pdf
Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in Readings in Computer Architecture. pdf
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
Culler & Singh, Chapter 1
Hamming, “You and Your Research,” Bell Communications Research Colloquium Seminar, 7 March 1986. here

Optional:

Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. pdf
Kumar et al., “Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors,” ISCA 2007. pdf

Supplementary Readings on Research, Writing, Reviews:

Levin and Redell, “How (and how not) to write a good systems paper,” OSR 1983. pdf
Smith, “The Task of the Referee,” IEEE Computer 1990. pdf
SP Jones, “How to Write a Great Research Paper”. pdf
Fong, “How to Write a CS Research Paper: A Bibliography”. pdf

Lecture 2

Required:

Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. pdf
Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. pdf
Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA 2007. pdf

Optional:

Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966. pdf
Thornton, “CDC 6600: Design of a Computer,” 1970. pdf
Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. pdf
Eyerman and Eeckhout, “Modeling critical sections in Amdahl's law and its implications for multicore design,” ISCA 2010. pdf
Suleman et al., “Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs,” ASPLOS 2008. pdf

Lecture 3

Required:

Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM 1993. pdf
Seitz, “The Cosmic Cube,” CACM 1985. pdf

Optional:

Li and Hudak, “Memory Coherence in Shared Virtual Memory Systems, ” ACM TOCS 1989. pdf
Batcher, “Architecture of a massively parallel processor,” ISCA 1980. pdf
Tucker and Robertson, “Architecture and Applications of the Connection Machine,” IEEE Computer 1988. pdf

Lecture 4

Optional:

Moore, “Cramming more components onto integrated circuits,” Electronics, 1965. pdf
Stark, “On pipelining dynamic instruction scheduling logic,” MICRO 2000. pdf
Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. pdf
Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999. pdf
Palacharla et al., “Complexity-effective superscalar processors,” ISCA 1997. pdf

Lecture 5

Optional:

Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
Barroso et al., “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” ISCA 2000. pdf
Barroso et al., “Memory system characterization of commercial workloads,” ISCA 1998. pdf
Ranganathan et al., “Performance of database workloads on shared-memory systems with out-of-order processors,” ASPLOS 1998. pdf
Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro 2005. pdf
Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. pdf
Chaudhry et al., “Rock: A High-Performance Sparc CMT Processor,” IEEE Micro, 2009. pdf
Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor,” ISCA 2009. pdf
Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003. pdf
Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” IEEE Micro Jan/Feb 2006. pdf
Tendler et al., “POWER4 system microarchitecture,” IBM J R&D, 2002. pdf
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
Le et al., “IBM POWER6 Microarchitecture,” IBM J R&D, 2007. pdf
Kalla et al., “Power7: IBM’s Next-Generation Server Processor,” IEEE Micro 2010. pdf
Grochowski et al., “Best of both Latency and Throughput,” ICCD 2004. pdf
Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. pdf
Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf

Lecture 6

Recommended:

Ipek et al., “Core Fusion: Accomodating Software Diversity in Chip Multiprocessors,” ISCA 2007. pdf
Ausavarungnirun et al., “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” ISCA 2012. pdf

Optional:

Kumar et al., “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” MICRO 2003. pdf
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. pdf
Suleman et al., “Data marshaling for multi-core architectures,” ISCA 2010. pdf
Suleman et al., “Data Marshaling for Multicore Systems,” IEEE Micro 2011. pdf
Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. pdf
Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
Kim et al., “Thread Cluster Memory Scheduling,” IEEE Micro 2011. pdf
Nychis et al., “Next generation on-chip networks: what kind of congestion control do we need?,” HotNets 2010. pdf
Das et al., “Application-aware prioritization mechanisms for on-chip networks,” MICRO 2009. pdf
Das et al., “Aérgia: exploiting packet latency slack in on-chip networks,” ISCA 2010. pdf
Das et al., “Aérgia: A Network-on-Chip Exploiting Packet Latency Slack,” IEEE Micro 2011. pdf
Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
Suleman et al., “Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs,” ASPLOS 2008. pdf
Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf
Morad et al., “Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors,” IEEE CAL 2006. pdf
Suleman et al., “ACMP: Balancing Hardware Efficiency and Programmer Efficiency,” HPS Technical Report 2007. pdf
Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. pdf
Suleman, “An Asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs,” PhD thesis 2010. pdf

Lecture 7

Optional:

Lefurgy et al., “Energy Management for Commercial Servers,” IEEE Computer 2003. pdf
Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009. pdf
Lee et al., “Phase-Change Technology and the Future of Main Memory,” IEEE Micro 2010. pdf
Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. pdf
Dhiman et al, “PDRAM: a hybrid PRAM and DRAM main memory system,” DAC 2009. pdf
Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. pdf

Lecture 8

Optional:

Suleman et al., “Data marshaling for multi-core architectures,” ISCA 2010. pdf
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
Suleman et al., “Data Marshaling for Multicore Systems,” IEEE Micro 2011. pdf
Chakraborty et al., “Computation Spreading: Employing Hardware Migration to Specialize CMP Cores on-the-fly,” ASPLOS 2006. pdf
Rangan et al., “Thread Motion: Fine-Grained Power Management for Multi-Core Systems,” ISCA 2009. pdf

Lecture 9

Required:

Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. pdf
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. pdf
Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf

Recommended:

Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. pdf
Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. pdf
Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. pdf

Optional:

Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
Kim et al., “Thread Cluster Memory Scheduling,” IEEE Micro 2011. pdf
Ausavarungnirun et al., “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” ISCA 2012. pdf
Ebrahimi et al., “Parallel Application Memory Scheduling,” MICRO 2011. pdf
Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. pdf
Thornton, “Design of a Computer: The Control Data 6600,” 1970. pdf
Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964. pdf
McNairy and Bhatia, “Montecito: A Dual-Core, Dual-Thread Itanium Processor,” IEEE Micro 2005. pdf

Lecture 10

Required:

Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. pdf
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. pdf
Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf

Recommended:

Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. pdf
Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. pdf
Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. pdf

Optional:

Yamamoto et al., “Performance Estimation of Multistreamed, Supersealar Processors,” HICSS 1994. pdf
Tullsen et al., “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” ISCA 1995. pdf
Snavely and Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” ASPLOS 2000. pdf
Jacobsen et al., “Assigning confidence to conditional branch predictions,” MICRO 1996. pdf
Brown and Tullsen, “Handling Long-latency Loads in a Simultaneous Multithreading Processor,” MICRO 2001. pdf
El-Moursy and Albonesi, “Front-End Policies for Improved Issue Efficiency in SMT Processors,” HPCA 2003. pdf
Raasch and Reinhardt, “The Impact of Resource Partitioning on SMT Processors,” PACT 2003. pdf
Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf
Ramirez et al., “Runahead Threads to Improve SMT Performance,” HPCA 2008. pdf
Van Craeynest et al., “MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor,” HiPEAC 2009. pdf
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
Lebeck et al., “A Large, Fast Instruction Window for Tolerating Cache Misses,” ISCA 2002. pdf
Marr et al., “Hyper-Threading Technology Architecture and Microarchitecture,” Intel technology Journal 2002. pdf

Lecture 11

Optional:

Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. pdf
Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. pdf
Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. pdf
Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. pdf
Dusser et al., “Zero-Content Augmented Caches,” ICS 2009. pdf
Islam and Stenstrom, “Zero-Value Caches: Cancelling Loads that Return Zero,” PACT 2009. pdf
Yang et al., “Frequent Value Compression in Data Caches,” MICRO 2000. pdf
Alameldeen and Wood, “Adaptive Cache Compression for High-Performance Processors,” ISCA 2004. pdf
Thoziyoor et al., “A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies,” ISCA 2008. pdf
Ekman and Stenstrom, “A Robust Main-Memory Compression Scheme,” ISCA 2005. pdf
Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012. pdf
Ubal et al., “Multi2Sim: A Simulation Framework for CPU-GPU Computing,” PACT 2012. pdf
Chen et al., “C-Pack: A High-Performance Microprocessor Cache Compression Algorithm,” VLSI 2010. pdf
Magnusson et al., “Simics: A full system simulation platform,” Computer 2002. pdf
Tremaine et al., “Pinnacle: IBM MXT in a memory controller chip,” IEEE Micro 2001. pdf

Lecture 12

Optional:

Johnson and Hwu, “Run-Time Adaptive Cache Hierarchy Management via Reference Analysis,” ISCA 1997. pdf
Piquet et al., “Exploiting single-usage for effective memory management,” ACSAC 2007. pdf
Wu et al., “SHIP: Signature-based hit predictor for high performance caching,” MICRO 2011. pdf
Qureshi et al., “Adaptive insertion policies for high performance caching,” ISCA 2007. pdf
Jaleel et al., “Adaptive insertion policies for managing shared caches,” PACT 2008. pdf
Jaleel et al., “High performance cache replacement using re-reference interval prediction,” ISCA 2010. pdf
Xie and Loh, “PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches,” ISCA 2009. pdf
Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation,” MICRO 2006. pdf
Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. pdf

Lecture 13

Optional:

Reinhardt and Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” ISCA 2000. pdf
Rotenberg, “AR-SMT: a microarchitectural approach to fault tolerance in microprocessors,” Fault-Tolerant Computing 1999. pdf
Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” ISCA 2002. pdf
Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999. pdf
Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” MICRO 1999. pdf
Qureshi et al., “Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors,” DSN 2005. pdf
Zilles et al., “The use of multithreading for exception handling,” MICRO 1999. pdf
Dubois and Song, “Assisted Execution,” USC Tech Report 1998. pdf
Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),” ISCA 1999. pdf
Chappell et al., “Difficult-path branch prediction using subordinate microthreads,” ISCA 2002. pdf
Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA 2001. pdf

Lecture 15

Required:

Sohi et al., “Multiscalar Processors,” ISCA 1995. pdf
Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf

Recommended:

Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. pdf
Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. pdf

Optional:

Luk, “Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors,” ISCA 2001. pdf
Sundaramoorthy et al., “Slipstream Processors: Improving both Performance and Fault Tolerance,“ ASPLOS 2000. pdf
Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window,” PACT 2005. pdf
Snavely and Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” ASPLOS 2000. pdf
Gopal et al., “Speculative Versioning Cache,” HPCA 1998. pdf
Franklin and Sohi, “The expandable split window paradigm for exploiting fine-grain parallelism,” ISCA 1992. pdf

Lecture 16

Required:

Sohi et al., “Multiscalar Processors,” ISCA 1995. pdf
Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf

Recommended:

Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. pdf
Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. pdf

Optional:

Franklin and Sohi, “ARB: A hardware mechanism for dynamic reordering of memory references,” IEEE TC 1996. pdf
Vijaykumar and Sohi, “Task selection for a multiscalar processor,” MICRO 1998. pdf
Moshovos et al., “Dynamic Speculation and Synchronization of Data Dependences,” ISCA 1997. pdf
Chrysos and Emer, “Memory Dependence Prediction using Store Sets,” ISCA 1998. pdf
Martinez and Torrellas, “Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS 2002. pdf
Rajwar and Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” ASPLOS 2002. pdf
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. pdf
Shavit and Touitou, “Software transactional memory,” PODC 1995. pdf

Lecture 17

Required:

Dally, “Virtual Channel Flow Control,” ISCA 1990. pdf
Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip Networks,” ISCA 2004. pdf
Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. pdf
Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. pdf
Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. pdf

Recommended:

Fallin et al., “CHIPPER: A Low-Complexity, Bufferless Deflection Router,” HPCA 2011. pdf
Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012. pdf
Bjerregaard and Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. pdf

Optional:

Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM 1993. pdf
Das et al., “Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs,” HPCA 2009. pdf
Seitz, “The Cosmic Cube,” CACM 1985. pdf
Gottlieb et al. “The NYU Ultracomputer-designing a MIMD, shared-memory parallel machine,” ISCA 1982. pdf

Lecture 18

Required:

Dally, “Virtual Channel Flow Control,” ISCA 1990. pdf
Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip Networks,” ISCA 2004. pdf
Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. pdf
Fallin et al., “CHIPPER: A Low-Complexity, Bufferless Deflection Router,” HPCA 2011. pdf
Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012. pdf
Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. pdf

Recommended:

Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. pdf
Bjerregaard and Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. pdf
Chang et al., “HAT: Heterogeneous Adaptive Throttling for On-Chip Networks,” SBAC-PAD 2012. pdf

Optional:

Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. pdf
Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro 1997. pdf

Lecture 20

Optional:

Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. pdf
Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer 1994. pdf
Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985. pdf
Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf
Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
Martinez and Torrellas, “Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS 2002. pdf
Rajwar and Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” ASPLOS 2002. pdf
Shavit and Touitou, “Software transactional memory,” PODC 1995. pdf
Dice et al., “Early experience with a commercial hardware transactional memory implementation,” ASPLOS 2009. pdf
Wang et al., “Evaluation of blue Gene/Q hardware support for transactional memories,” PACT 2012. pdf
Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. pdf

Lecture 21

Optional:

Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. pdf
Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer 1994. pdf
Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985. pdf
Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA 2003. pdf
Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer 2004. pdf
Das et al., “Application-aware prioritization mechanisms for on-chip networks,” MICRO 2009. pdf
Das et al., “Aérgia: exploiting packet latency slack in on-chip networks,” ISCA 2010. pdf
Grot et al., “Express Cube Topologies for On-Chip Interconnects,” HPCA 2009. pdf
Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees,” ISCA 2011. pdf
Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009. pdf

Lecture 22

Optional:

Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. pdf
Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer 1994. pdf
Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985. pdf
Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA 2003. pdf
Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer 2004. pdf
Dennis and Misunas, “A preliminary architecture for a basic data flow processor,” ISCA 1974. pdf
Treleaven et al., “Data-Driven and Demand-Driven Computer Architecture,” ACM Computing Surveys 1982. pdf
Veen, “Dataflow Machine Architecture,” ACM Computing Surveys 1986. pdf
Arvind and Nikhil, “Executing a program on the MIT tagged-token dataflow architecture,” IEEE TC 1990. pdf
Hwu and Patt, “HPSm, a high performance restricted data flow architecture having minimal functionality,” ISCA 1986. pdf

Lecture 23

Optional:

Sakai et al., “An Architecture of a Dataflow Single Chip Processor,” ISCA 1989. pdf
Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. pdf
Colwell, “The Pentium Chronicles,” Wiley-IEEE Computer Society Press 2005.
Kung, “Why Systolic Architectures?,” IEEE Computer 1982. pdf
Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986. pdf
Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987. pdf

Lecture 24

Required:

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. pdf
Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf
Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. pdf
Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. pdf
Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. pdf
Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. pdf
Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. pdf
Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. pdf
Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches,” ISCA 2009. pdf

Recommended:

Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” MICRO 2008. pdf
Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008. pdf
Kim et al., “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” ASPLOS 2002. pdf
Qureshi et al., “Adaptive Insertion Policies for High-Performance Caching,” ISCA 2007. pdf
Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. pdf

Optional:

Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. pdf
Grot et al., “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“ MICRO 2009. pdf

Lecture 25

Required:

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. pdf
Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf
Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. pdf
Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. pdf
Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. pdf

Recommended:

Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” MICRO 2008. pdf
Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008. pdf

Optional:

Moscibroda and Mutlu, “Distributed order scheduling and its application to multi-core DRAM controllers,” PODC 2008. pdf
Waldspurger and Weihl, “Lottery scheduling: flexible proportional-share resource management,” OSDI 1994. pdf

Lecture 26

Required:

Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. pdf
Ebrahimi et al., “Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems,” ASPLOS 2010. pdf
Subramanian et al., “MISE: Providing Performance Predictability in Shared Main Memory Systems,” HPCA 2013.

Recommended:

Kim et al., “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” MICRO 2010. pdf
Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. pdf
Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems,” ISCA 2008. pdf
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. pdf
Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf

Lecture 27

Required:

Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. pdf
Ebrahimi et al, “Coordinated Control of Multiple Prefetchers in Multi-Core Systems,” HPCA 2009. pdf

Recommended:

Rixner et al., “Memory Access Scheduling,” ISCA 2000. pdf
Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. pdf
Srinath et al, “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” HPCA 2007. pdf
Zhuang and Lee, “A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches,” ICPP 2003. pdf
Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. pdf

Trace: • readings

Readings

Lecture 1

Lecture 2

Lecture 3

Lecture 4

Lecture 5

Lecture 6

Lecture 7

Lecture 8

Lecture 9

Lecture 10

Lecture 11

Lecture 12

Lecture 13

Lecture 15

Lecture 16

Lecture 17

Lecture 18

Lecture 20

Lecture 21

Lecture 22

Lecture 23

Lecture 24

Lecture 25

Lecture 26

Lecture 27

Navigation

Search

Toolbox