Link to CALCM Home  


Transient-fault recovery via Simultaneous Multithreading

Tuesday December 10, 2002
Hamerschlag Hall D-210
4:00 p.m.

T. N. Vijaykumar

Assistant Professor of ECE, Purdue University

To address the increasing susceptibility of commodity microprocessors to transient faults, we propose a scheme for transient-fault recovery called Simultaneously and Redundantly Threaded processors with Recovery (SRTR) that enhances a previously proposed scheme for transient-fault detection, called Simultaneously and Redundantly Threaded (SRT) processors. SRT replicates an application into two communicating threads, one executing ahead of the other. The trailing thread repeats the computation performed by the leading thread, and the values produced by the two threads are compared. In SRT, a leading instruction may commit before the check for faults occurs, relying on the trailing thread to trigger detection. In contrast, SRTR must not allow any leading instruction to commit before checking occurs, since a faulty instruction cannot be undone once the instruction commits. To avoid stalling leading instructions at commit while waiting for their trailing counterparts, SRTR exploits the time between the completion and commit of leading instructions. SRTR compares the leading and trailing values as soon as the trailing instruction completes, typically before the leading instruction reaches the commit point. To avoid increasing the bandwidth demand on the register file for checking register values, SRTR uses the register value queue (RVQ) to hold register values for checking. To reduce the bandwidth pressure on the RVQ itself, SRTR employs dependence-based checking elision (DBCE). By reasoning that faults propagate through dependent instructions, DBCE exploits register (true) dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked. SRTR performs within 1% and 7% of SRT for SPEC95 integer and floating-point programs, respectively. While SRTR without DBCE incurs about 18% performance loss when the number of RVQ ports is reduced from four (which is performance-equivalent to an unlimited number) to two ports, with DBCE, a two-ported RVQ performs within 2% of a four-ported RVQ.

T. N. Vijaykumar joined the faculty of the School of Electrical and Computer Engineering in 1998 after completing his Ph.D. at the University of Wisconsin-Madison. His research interests are in computer architecture, microarchitecture, low power, and fault tolerance.



Department of Electrical and Computer EngineeringCarnegie Mellon UniversitySchool of Computer Science