Understanding
Failure at Scale
Tuesday April 18, 2006
Hamerschlag Hall D-210
4:30 pm
Bianca
Schroeder
Carnegie Mellon University
Designing highly dependable systems requires a good understanding of
failure characteristics. Unfortunately little raw data on failures in
large IT installations is publicly available, due to the confidential
nature of this data. In our recent work we analyze soon-to-be-public
failure data covering systems at a top-ten high-performance computing
site. The data has been collected over the past 9 years at Los Alamos
National Laboratory and includes 23000 failures recorded on more than 20
different systems, mostly large clusters of SMP and NUMA nodes. In this
talk we will give an overview of our results, including statistics on
the root cause of failures, the mean time between failures, and the mean
time to repair, and how system parameters such as system size affect
these statistics.
This work is joint work with Garth Gibson and will appear in DSN'06.
Bianca completed her PhD in August'05 under the guidance of Mor
Harchol-Balter, and is now a postdoctoral fellow working with Garth
Gibson. Bianca is interested in workload and failure data analysis and
how the results can be used to improve system design, in particular
scheduling and resource allocation.
|