CALCM - Computer Architecture Lab at Carnegie Mellon


	Faculty Students Projects Seminar Reports Links Contacts

Reliability Through Architecture: Fault-Tolerant Computer Systems for the 21st Century

Tuesday November 25, 2003
Hamerschlag Hall D-210
4:00 pm

Brian Gold
Carnegie Mellon University

Fault tolerant computer systems have been used for decades in mission-critical applications. These systems employ a variety of well-studied mechanisms for detecting permanent and transient faults and recovering from their effects. Unfortunately, traditional approaches to add fault-tolerant capabilities to existing microprocessors typically result in reduced performance and/or increased design complexity when compared to the original, non-fault-tolerant processor. To make matters worse, it is widely believed that the soft error rate in combinational logic will increase exponentially with continued scaling in feature size and supply voltage. While fault tolerance was once the concern of a relatively small sector of applications, an increasing rate of soft errors in combinational logic will force the general computing industry to address fault-tolerant architectures.

The time has come to revisit conventional fault-tolerant architectures and examine how we might construct a scalable, fault-tolerant system that yields performance on par with commodity computers. In this talk, I will take a detailed look at two papers that propose a scalable approach to system-level reliability and fault tolerance (ReVive and SafetyNet). Both of these proposals are hardware mechanisms and do not require modifications to application software.

Brian Gold is a first-year Ph.D. candidate in the Electrical and Computer Engineering Department of Carnegie Mellon. He received B.S. degrees in Electrical Engineering and Applied Computational Mathematics in 2001 and the M.S. degree in Computer Engineering in 2003, all from Virginia Tech. His research interests are in novel computer architectures that consider design complexity, power, and reliability without sacrificing performance.