Link to CALCM Home  

RAS Design Practices in Commercial Microprocessors

Tuesday October 3, 2006
Hamerschlag Hall 1112
4:30 pm



Nhon Quach
AMD

As the dimensions of transistor devices become smaller and more susceptible to soft errors with each technology generation, ensuring the reliability of a commercial microprocessor is becoming a bigger challenge. The task is further exacerbated by the trends of integrating larger caches and more processor cores on a single socket.

This presentation consists of two parts. In the first part, I will introduce the general concepts (soft errors, error rates, reliability vs. availability, etc.) and considerations behind strategies used to implement RAS (reliability, availability, and serviceability) features in current commercial microprocessors. I will examine past design practices in the area of RAS, present the challenges facing current designers, and discuss future trends. I will use examples at the end to tie these concepts together.

In the second part, time permitting, I will delve deeply into two topics to illustrate the concepts presented in the first part: thread level redundancy (TLR) and data poisoning. TLR is an efficient microarchitectural redundancy technique to protect a processor against soft errors. The technique shows great potential as a way to combat soft errors in the combinational logic and has been widely researched and reported in the literature. However, in this presentation, I will show that the technique, as proposed in the literature, is unlikely to be adopted in commercial microprocessors due to its high verification complexity. Data poisoning is a way to mitigate the impact of double-bit ECC errors in the processor. I will describe its benefits and implementation cost in this part of the presentation. The technique is chosen because it's relatively new and serves as a good example of graceful degradation that I talk about in the first part of the presentation.


Nhon Quach has over 15 years of industry experience in various aspects of microprocessor design at different companies including AMD, Oracle, and Intel. He is currently an architect working on the processor specific features (including RAS) of AMD's next generation processor. At Oracle, he set hardware and software optimization strategies for various server platforms in the virtual OS server technology group. Prior to joining Oracle, he was the system architect as well as the manager of the Itanium system architecture group at Intel and defined the RAS architecture for the first generation Itanium processor. His Ph.D. research work at Stanford on high-speed integer addition, integrated rounding for FP addition, and systematic rounding for FP multiplication has been widely adopted in the FP hardware design community today.

 

Department of Electrical and Computer EngineeringCarnegie Mellon UniversitySchool of Computer Science