Link to CALCM Home  

Understanding Failure at Scale

Tuesday April 18, 2006
Hamerschlag Hall D-210
4:30 pm

Bianca Schroeder
Carnegie Mellon University

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately little raw data on failures in large IT installations is publicly available, due to the confidential nature of this data. In our recent work we analyze soon-to-be-public failure data covering systems at a top-ten high-performance computing site. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. In this talk we will give an overview of our results, including statistics on the root cause of failures, the mean time between failures, and the mean time to repair, and how system parameters such as system size affect these statistics.

This work is joint work with Garth Gibson and will appear in DSN'06.

Bianca completed her PhD in August'05 under the guidance of Mor Harchol-Balter, and is now a postdoctoral fellow working with Garth Gibson. Bianca is interested in workload and failure data analysis and how the results can be used to improve system design, in particular scheduling and resource allocation.


Department of Electrical and Computer EngineeringCarnegie Mellon UniversitySchool of Computer Science