Reliability, Availability and Serviceability for Scalable Memory Systems
Tuesday February 28, 2006
Hamerschlag Hall D-210
Carnegie Mellon University
The integrity of the memory subsystem is critical in the server market,
where nonstop computers must deliver reliable and correct computation.
Therefore, machines such as distributed shared-memory servers must cope
with various sources of errors such as multibit upset and permanent
hardware failure without losing information or halting execution. Not
only must the system recover data from a lost node, but it must also
allow on-line replacement and repair of faulty memory. Finally, the
memory protection mechanism must have a low performance overhead.
In this talk, I will present DRUM, a physical memory protection
mechanism for DSM servers. DRUM uses distributed parity protection to
transparently detect and recover from both soft and permanent errors.
Even after a permanent error is detected, DRUM continues to serve memory
requests and protect memory from further errors. During error-free
execution, DRUM minimizes the performance overhead by keeping parity
updates off the critical path of memory requests.
Jangwoo Kim is a PhD candidate in the Computer Architecture Lab at
Carnegie Mellon, working with Prof. Babak Falsafi. Jangwoo's research
interests include fault tolerant computer architecture and full system