Main memory is one of the leading hardware causes for machine crashes in today's datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors.
In this project, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, we use the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluate the potential of different protection mechanisms in light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system.
Ioan Stefanovici is a PhD student in the Computer Systems and Networks Group at the University of Toronto, under the supervision of Prof. Bianca Schroeder. His current research deals primarily with improving the performance and reliability of large-scale computer systems running scientific and commercial applications. He received his MSc. in 2012 with the same advisor, and his H.Bsc. from the University of Toronto in 2010 doing a Computer Science: Information Security Specialist with a minor in Mathematics. He's also had industry experience in the form of internships at Microsoft and Google.
Back to the seminar page