Runahead Execution:
A Power-efficient Mechanism for Tolerating Long Main Memory Latencies
Tuesday May 30, 2006
Hamerschlag Hall 1112
4:00 pm
Onur
Mutlu
University of Texas at Austin
High-performance processors tolerate latency using out-of-order
execution. Unfortunately, today's processors are facing memory latencies
in the order of hundreds of cycles. To tolerate such long latencies,
out-of-order execution requires an instruction window that is
unreasonably large, in terms of design complexity, hardware cost, and
power consumption. Therefore, current processors spend most of their
execution time stalling and waiting for long-latency cache misses to
return from main memory. And, the problem is getting worse because
memory latencies are increasing in terms of processor cycles.
In this talk, I will first present runahead execution, a mechanism that
provides the memory latency tolerance benefits of a large instruction
window without requiring the building of large, complex, and
power-hungry hardware structures. A runahead execution processor
continues to perform useful computation while a long-latency cache miss
is in progress instead of stalling. Runahead execution unblocks the
instruction window blocked by a long-latency cache miss, allowing the
processor to execute far ahead in the program path. This results in
other independent long-latency cache misses to be discovered and their
data to be prefetched into caches long before it is needed. Our
evaluations show that runahead execution on a processor with a 128-entry
instruction window achieves the performance of a processor with three
times the instruction window size.
In the second part of my talk, I will explain the two major limitations
of runahead execution: its energy inefficiency and its inability to
parallelize dependent cache misses. I will briefly touch on the
solutions we developed to make runahead execution energy efficient.
Then, I will describe an efficient hardware technique, called
address-value delta (AVD) prediction, that predicts the values of
pointer load instructions encountered in runahead execution in order to
enable the parallelization of dependent cache misses, which are common
in programs that employ linked data structures. I will provide an
analysis of the high-level programming constructs that result in
AVD-predictable load instructions.
Onur Mutlu is a PhD candidate at the University of Texas at Austin. His
PhD research has been in computer architecture with a focus on
high-performance energy-efficient computer architectures, novel latency
tolerance and branch processing techniques, and
programmer-compiler-microarchitecture interaction. He received a BSE in
computer engineering and a BS in psychology from University of Michigan
and an MSE in computer engineering from UT-Austin. Onur worked at Intel
Corporation during summers 2001-2003 and at Advanced Micro Devices
during summers 2004-2005. He was a recipient of the Intel PhD fellowship
in 2004 and the University of Texas George H. Mitchell Award for
Excellence in Graduate Research in 2005.
|