Link to CALCM Home  

Transient Fault Detection and Recovery in Superscalar Processors

April 9, 2002 Tuesday
Hamerschlag Hall 1112
4:00 p.m.

Joydeep Ray
M.S. Student, Department of ECE, Carnegine Mellon

In an effort to keep up with the Moore's Law, microprocessor implementations have required ever decreasing feature size and supply voltage. As a consequence of the reduced capacitive node charge and noise margin, even flip-flop circuits will inevitably become susceptible to soft-errors. The high clock rate of modern processors further exacerbates the problem by increasing the probability of a new failure mechanism where a momentarily corrupted signal is latched by a flip-flop. These necessary evils of continually pushing the processor performance envelope will shortly place us in an unfamiliar realm where logically correct implementations alone cannot ensure correct program execution with sufficient confidence. In this talk, we propose a fault-tolerant extension for modern superscalar out-of-order datapath. We argue that a single processor that can selectively deliver fault-tolerance when required and can otherwise revert to full performance will be an important design point in the transitional phase when transient failure rates are just becoming unacceptable for some applications. This dual-functionality will enable both vendors and end-users to trade-off performance for fault-protection depending on applications. In the proposed extensions, error-detection is achieved by verifying the redundant results of dynamically replicated threads of executions, while the error-recovery scheme employs the instruction-rewind mechanism to restart at a failed instruction. The proposed scheme can be supported by only modest additional hardware. Nevertheless, it delivers comparable performance and fault coverage as static fully-replicated solutions. Simulation results with 11 SPEC95 and SPEC2000 benchmarks show that in the absence of faults, error detection costs a performance loss of 32% on average, which is comparable to the previously proposed schemes. In the presence of faults, the fast error recovery scheme contributes very little additional slowdown.

Joydeep Ray is presently a second year graduate student in the department of Electrical and Computer Engineering at Carnegie Mellon University. He received his B.S. from Indian Institute of Technology, Kharagpur. His research interests include fault-tolerant processor architecture and FPGA prototyping of micro-architecture. In this project, he was supervised by Prof. James C. Hoe and Prof. Babak Falsafi.


Department of Electrical and Computer EngineeringCarnegie Mellon UniversitySchool of Computer Science