CALCM - Computer Architecture Lab at Carnegie Mellon


	Faculty Students Projects Seminar Reports Links Contacts

Runahead Execution: A Power-efficient Mechanism for Tolerating Long Main Memory Latencies

Tuesday May 30, 2006
Hamerschlag Hall 1112
4:00 pm

Onur Mutlu
University of Texas at Austin

High-performance processors tolerate latency using out-of-order execution. Unfortunately, today's processors are facing memory latencies in the order of hundreds of cycles. To tolerate such long latencies, out-of-order execution requires an instruction window that is unreasonably large, in terms of design complexity, hardware cost, and power consumption. Therefore, current processors spend most of their execution time stalling and waiting for long-latency cache misses to return from main memory. And, the problem is getting worse because memory latencies are increasing in terms of processor cycles.

In this talk, I will first present runahead execution, a mechanism that provides the memory latency tolerance benefits of a large instruction window without requiring the building of large, complex, and power-hungry hardware structures. A runahead execution processor continues to perform useful computation while a long-latency cache miss is in progress instead of stalling. Runahead execution unblocks the instruction window blocked by a long-latency cache miss, allowing the processor to execute far ahead in the program path. This results in other independent long-latency cache misses to be discovered and their data to be prefetched into caches long before it is needed. Our evaluations show that runahead execution on a processor with a 128-entry instruction window achieves the performance of a processor with three times the instruction window size.

In the second part of my talk, I will explain the two major limitations of runahead execution: its energy inefficiency and its inability to parallelize dependent cache misses. I will briefly touch on the solutions we developed to make runahead execution energy efficient. Then, I will describe an efficient hardware technique, called address-value delta (AVD) prediction, that predicts the values of pointer load instructions encountered in runahead execution in order to enable the parallelization of dependent cache misses, which are common in programs that employ linked data structures. I will provide an analysis of the high-level programming constructs that result in AVD-predictable load instructions.

Onur Mutlu is a PhD candidate at the University of Texas at Austin. His PhD research has been in computer architecture with a focus on high-performance energy-efficient computer architectures, novel latency tolerance and branch processing techniques, and programmer-compiler-microarchitecture interaction. He received a BSE in computer engineering and a BS in psychology from University of Michigan and an MSE in computer engineering from UT-Austin. Onur worked at Intel Corporation during summers 2001-2003 and at Advanced Micro Devices during summers 2004-2005. He was a recipient of the Intel PhD fellowship in 2004 and the University of Texas George H. Mitchell Award for Excellence in Graduate Research in 2005.