Transient-fault
recovery via Simultaneous Multithreading
Tuesday December 10, 2002
Hamerschlag Hall D-210
4:00 p.m.
T. N. Vijaykumar
Assistant Professor of ECE, Purdue University
To address the increasing susceptibility of commodity microprocessors
to transient faults, we propose a scheme for transient-fault recovery
called Simultaneously and Redundantly Threaded processors with Recovery
(SRTR) that enhances a previously proposed scheme for transient-fault
detection, called Simultaneously and Redundantly Threaded (SRT) processors.
SRT replicates an application into two communicating threads, one executing
ahead of the other. The trailing thread repeats the computation performed
by the leading thread, and the values produced by the two threads are
compared. In SRT, a leading instruction may commit before the check for
faults occurs, relying on the trailing thread to trigger detection. In
contrast, SRTR must not allow any leading instruction to commit before
checking occurs, since a faulty instruction cannot be undone once the
instruction commits. To avoid stalling leading instructions at commit
while waiting for their trailing counterparts, SRTR exploits the time
between the completion and commit of leading instructions. SRTR compares
the leading and trailing values as soon as the trailing instruction completes,
typically before the leading instruction reaches the commit point. To
avoid increasing the bandwidth demand on the register file for checking
register values, SRTR uses the register value queue (RVQ) to hold register
values for checking. To reduce the bandwidth pressure on the RVQ itself,
SRTR employs dependence-based checking elision (DBCE). By reasoning that
faults propagate through dependent instructions, DBCE exploits register
(true) dependence chains so that only the last instruction in a chain
uses the RVQ, and has the leading and trailing values checked. SRTR performs
within 1% and 7% of SRT for SPEC95 integer and floating-point programs,
respectively. While SRTR without DBCE incurs about 18% performance loss
when the number of RVQ ports is reduced from four (which is performance-equivalent
to an unlimited number) to two ports, with DBCE, a two-ported RVQ performs
within 2% of a four-ported RVQ.
T. N. Vijaykumar joined the faculty of the School
of Electrical and Computer Engineering in 1998 after completing his Ph.D.
at the University of Wisconsin-Madison. His research interests are in
computer architecture, microarchitecture, low power, and fault tolerance.
|