Transient Fault Detection
and Recovery in Superscalar Processors
April 9, 2002 Tuesday
Hamerschlag Hall 1112
4:00 p.m.
Joydeep Ray
M.S. Student, Department of ECE, Carnegine Mellon
In an effort to keep up with the Moore's Law, microprocessor implementations
have required ever decreasing feature size and supply voltage. As
a consequence of the reduced capacitive node charge and noise margin,
even flip-flop circuits will inevitably become susceptible to soft-errors.
The high clock rate of modern processors further exacerbates the
problem by increasing the probability of a new failure mechanism
where a momentarily corrupted signal is latched by a flip-flop.
These necessary evils of continually pushing the processor performance
envelope will shortly place us in an unfamiliar realm where logically
correct implementations alone cannot ensure correct program execution
with sufficient confidence. In this talk, we propose a fault-tolerant
extension for modern superscalar out-of-order datapath. We argue
that a single processor that can selectively deliver fault-tolerance
when required and can otherwise revert to full performance will
be an important design point in the transitional phase when transient
failure rates are just becoming unacceptable for some applications.
This dual-functionality will enable both vendors and end-users to
trade-off performance for fault-protection depending on applications.
In the proposed extensions, error-detection is achieved by verifying
the redundant results of dynamically replicated threads of executions,
while the error-recovery scheme employs the instruction-rewind mechanism
to restart at a failed instruction. The proposed scheme can be supported
by only modest additional hardware. Nevertheless, it delivers comparable
performance and fault coverage as static fully-replicated solutions.
Simulation results with 11 SPEC95 and SPEC2000 benchmarks show that
in the absence of faults, error detection costs a performance loss
of 32% on average, which is comparable to the previously proposed
schemes. In the presence of faults, the fast error recovery scheme
contributes very little additional slowdown.
Joydeep Ray is presently a second year graduate
student in the department of Electrical and Computer Engineering at
Carnegie Mellon University. He received his B.S. from Indian Institute
of Technology, Kharagpur. His research interests include fault-tolerant
processor architecture and FPGA prototyping of micro-architecture.
In this project, he was supervised by Prof. James C. Hoe and Prof.
Babak Falsafi.
|