[Overview]   [Publications]   [Tech Reports]   [People]   [Links]   [Internal]
 
Overview
As information processing and storage have become a key pillar of a modern society's infrastructure, server availability and reliability are now critical aspects of computing systems. Unfortunately, while availability and reliability are becoming increasingly crucial, it is also ever more challenging to design, manufacture, and market reliable server platforms. This project proposes the Total Reliability Using Scalable Servers (TRUSS) architecture, a reliable, available, and servicable (RAS) hardware platform. TRUSS offers both cost and performance scalability unparalleled by conventional RAS-oriented servers by using commodity blade components interconnected through a scalable network and hardware distributed shared memory (DSM).
Publications
(copyright notice)

Chip-Level Redundancy in Distributed Shared Memory
Brian T. Gold, Babak Falsafi, and James C. Hoe
IEEE Pacific Rim International Symposium on Dependable Computing (PRDC-09), Novmeber 2009, PDF.

Modeling SRAM Failure Rates to Enable Fast, Dense, Low-Power Caches
Jangwoo Kim, Mark McCartney, Ken Mai, and Babak Falsafi
IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE-5), March 2009, PDF.

Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding
Jangwoo Kim, Nikos Hardavellas, Ken Mai, Babak Falsafi, and James C. Hoe
ACM/IEEE International Symposium on Microarchitecture (MICRO-40), December 2007, postscript, PDF.

PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers
Jangwoo Kim, Jared C. Smolens, Babak Falsafi, and James C. Hoe
IEEE Pacific Rim International Symposium on Dependable Computing (PRDC-07), December 2007, postscript, PDF.

Mitigating Multi-bit Soft Errors in L1 Caches Using Last Store Prediction
Brian T. Gold, Michael Ferdman, Babak Falsafi, and Ken Mai
Workshop on Architectural Support for Gigascale Integration (ASGI-07), June 2007, PDF.

Detecting Emerging Wearout Faults
Jared C. Smolens, Brian T. Gold, James C. Hoe, Babak Falsafi, and Ken Mai
2007 IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE-3), April 2007, PDF.

Fingerprinting Across On-chip Memory Interconnects
Srinivas Chellappa, Frédéric de Mesmay, Jared C. Smolens, Babak Falsafi, James C. Hoe, and Ken Mai
2007 IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE-3), April 2007, paper (PDF), poster (PDF).

Reunion: Complexity-Effective Multicore Redundancy
Jared C. Smolens, Brian T. Gold, Babak Falsafi, and James C. Hoe
ACM/IEEE International Symposium on Microarchitecture (MICRO-39), December 2006, postscript, PDF.

The Granularity of Soft-Error Containment in Shared-Memory Multiprocessors
Brian T. Gold, Jared C. Smolens, Babak Falsafi, and James C. Hoe
2006 Workshop on System Effects of Logic Soft Errors (SELSE-2), April 2006, paper (PDF), poster (PDF).

TRUSS: A Reliable, Scalable Server Architecture
Brian T. Gold, Jangwoo Kim, Jared C. Smolens, Eric Chung, Vasileios Liaskovitis, Eriko Nurvitadhi, Babak Falsafi, James C. Hoe and Andreas Nowatzyk
IEEE Micro Special Issue: Reliability-Aware Microarchitectures
November-December 2005, postscript, PDF.

Understanding the Performance of Concurrent Error Detecting Superscalar Microarchitectures
Jared C. Smolens, Jangwoo Kim, James C. Hoe, and Babak Falsafi
IEEE International Symposium on Signal Processing and Information Technology
Invited paper, December 2005, postscript, PDF.

Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures
Jared C. Smolens, Jangwoo Kim, James C. Hoe, and Babak Falsafi
ACM/IEEE International Symposium on Microarchitecture
(MICRO-37), December 2004, postscript, PDF.

Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth
Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, James C. Hoe, and Andreas G. Nowatzyk
International Conference on Architectural Support for Programming Languages and Operating Systems
(ASPLOS-XI), October 2004, postscript, PDF.
Also appears in IEEE Micro Special Issue: Top Picks from Computer Architecture Conferences,
November-December 2004, PDF.

Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery
Joydeep Ray, James Hoe, and Babak Falsafi
ACM/IEEE International Symposium on Microarchitecture
(MICRO-34), December 2001, postscript, PDF.

Technical Reports

Tolerating Processor Failures in a Distributed Shared-Memory Multiprocessor
Brian T. Gold, Babak Falsafi, and James C. Hoe
CALCM Technical Report 2006-1 PDF.

People
Faculty: Students: Alumni:
Links
Sponsors

 National Science Foundation

 Intel Corporation

 MARCO

 Center for Circuit & System Solutions

 Carnegie Mellon CyLab