Link to CALCM Home  

High-Performance Reliability, Availability and Serviceability for Scalable Memory Systems

Tuesday February 28, 2006
Hamerschlag Hall D-210
4:30 pm



Jangwoo Kim
Carnegie Mellon University

The integrity of the memory subsystem is critical in the server market, where nonstop computers must deliver reliable and correct computation. Therefore, machines such as distributed shared-memory servers must cope with various sources of errors such as multibit upset and permanent hardware failure without losing information or halting execution. Not only must the system recover data from a lost node, but it must also allow on-line replacement and repair of faulty memory. Finally, the memory protection mechanism must have a low performance overhead.

In this talk, I will present DRUM, a physical memory protection mechanism for DSM servers. DRUM uses distributed parity protection to transparently detect and recover from both soft and permanent errors. Even after a permanent error is detected, DRUM continues to serve memory requests and protect memory from further errors. During error-free execution, DRUM minimizes the performance overhead by keeping parity updates off the critical path of memory requests.


Jangwoo Kim is a PhD candidate in the Computer Architecture Lab at Carnegie Mellon, working with Prof. Babak Falsafi. Jangwoo's research interests include fault tolerant computer architecture and full system simulation.

 

Department of Electrical and Computer EngineeringCarnegie Mellon UniversitySchool of Computer Science