Reliability
Through Architecture: Fault-Tolerant Computer Systems for the 21st
Century
Tuesday November 25, 2003
Hamerschlag Hall D-210
4:00 pm
Brian Gold
Carnegie Mellon University
Fault tolerant computer systems have been used for decades in mission-critical
applications. These systems employ a variety of well-studied mechanisms
for detecting permanent and transient faults and recovering from their
effects. Unfortunately, traditional approaches to add fault-tolerant
capabilities to existing microprocessors typically result in reduced
performance and/or increased design complexity when compared to the original,
non-fault-tolerant processor. To make matters worse, it is widely believed
that the soft error rate in combinational logic will increase exponentially
with continued scaling in feature size and supply voltage. While fault
tolerance was once the concern of a relatively small sector of applications,
an increasing rate of soft errors in combinational logic will force the
general computing industry to address fault-tolerant architectures.
The time has come to revisit conventional fault-tolerant architectures
and examine how we might construct a scalable, fault-tolerant system
that yields performance on par with commodity computers. In this talk,
I will take a detailed look at two papers that propose a scalable approach
to system-level reliability and fault tolerance (ReVive and SafetyNet).
Both of these proposals are hardware mechanisms and do not require modifications
to application software.
Brian Gold is a first-year Ph.D. candidate in the Electrical and Computer
Engineering Department of Carnegie Mellon. He received B.S. degrees in
Electrical Engineering and Applied Computational Mathematics in 2001
and the M.S. degree in Computer Engineering in 2003, all from Virginia
Tech. His research interests are in novel computer architectures that
consider design complexity, power, and reliability without sacrificing
performance.
|