CSSI | Center for Silicon System Implementation

Electrical & Computer Engineering | Carnegie Mellon

Friday, October 4, 12:00-1:00 p.m. HH-1112

Subhasish Mitra
Stanford University

Dependable Reconfigurable Computing Design Diversity and Self Repair

We demonstrate the power of reconfigurable computing in enabling cost-effective implementations of dependable systems. New concurrent error detection techniques based on practical implementations of design diversity are presented. Field reconfigurability of reconfigurable hardware is utilized to design self-healing systems capable of autonomous recovery and repair from temporary errors and permanent faults. The applicability of these techniques are demonstrated through implementations on commercial reconfigurable hardware platforms.

An error detection scheme based on diverse duplication compares the outputs of two "different"implementations of the same function and indicates error when a mismatch occurs. The idea of such a technique is derived from the general concept of design diversity. The conventional notion of design diversity is qualitative and relies on "independent" generation of "different" implementations. A metric to quantify design diversity is presented along with synthesis algorithms to efficiently design systems with error detection based on diverse duplication.

In traditional dependable systems using hardware redundancy, fault tolerance is realized by detecting errors and locating the faulty chip or faulty board (Field Replaceable Unit or FRU) to be replaced by field service engineers. For systems designed using reconfigurable hardware, the FRU is very fined-grained such as a logic block or a routing resources (e.g., a pass-transistor based switch or a logic lookup table in Field Programmable Gate Arrays). Thus, in the case of a permanent fault, a cost-effective repair scheme is obtained using an alternative configuration in which the faulty parts are replaced with originally unused resources.

A new self-repairing reconfigurable computing architecture based on dual FPGAs with embedded "soft" micro-controllers is presented. This architecture allows the implemented system to recover from temporary errors and repair itself from permanent faults with minimum impact on system performance while ensuring very high data integrity and availability without external intervention. These capabilities make this architecture useful for a variety of dependable applications including unmanned remote applications such as deep space exploration.

Bio
Dr. Subhasish Mitra received his Ph.D. in Electrical Engineering from Stanford University in 2000. He is currently a Staff Engineer at Intel Corporation and a Consulting Assistant Professor at the EE Department at Stanford University. At Intel, Dr. Mitra works on Design for Testability, Reliability and Manufacturability. At Stanford CRC, he supervises Ph.D. students and is currently involved with the Stanford CRC test chip experiment project. Before that he was the project leader of the Stanford CRC ROAR (Reliability Obtained by Adaptive Reconfiguration) project. Dr. Mitra also provides part-time consulting in various areas of VLSI design and test. During 2000-2001 he consulted at Agilent Technologies in their System Chip Testing project. He spent a summer at Ambit Design Systems (now part of Cadence Design Systems) to integrate a special synthesis algorithm developed by him into Ambit's BuildGates tool. Dr. Mitra's research interests include digital testing, fault-tolerant computing, VLSI synthesis and computer architecture. He has published several papers in these areas in leading conferences and journals. He is also an inventor of patents on VLSI synthesis algorithms, fault-tolerant computing and VLSI test. Dr. Mitra received gold medals for being the top student in the School of Engineering in the undergraduate and M. Tech levels. Recently, at Intel he received a recognition award for developing a break-through compaction methodology for test cost reduction.