DRAM Errors in the Wild: A Large-scale Field Study

Tuesday April 14, 2009
Location TBD
4:00 pm



Prof. Bianca Schroeder
University of Toronto

Abstract

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this talk, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.

The goal of this talk is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?

Bios

Bianca is currently an assistant professor in the Computer Science Department at the University of Toronto. Before joining UofT, she spent 2 years as a post-doc at Carnegie Mellon University working with Garth Gibson. She received her doctorate from the Computer Science Department at Carnegie Mellon University under the direction of Mor Harchol-Balter in 2005. She is a two-time winner of the IBM PhD fellowship and her work has won three best paper awards. Her recent work on system reliability has been featured in articles at a number of news sites, including Computerworld, Slashdot, PCWorld, StorageMojo and eWEEK.

Bianca's research focuses on the design and implementation of computer systems. The methods she is using in her work are inspired by a broad array of disciplines, including performance modeling and analysis, workload and fault characterization, machine learning, and scheduling and queuing theory. Her work spans a number of different areas in computer systems, including high-performance computing systems, web servers, computer networks, database systems and storage systems.

Back to the seminar page